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The  research  described  in  this  report  is  presented  in  six  parts.’ 

Part  I:  On  Interprocess  Communication  studies  interprocess  communica¬ 
tion  without  assuming  any  lower-level  communication  primitives.  A 
formalism  is  developed  for  reasoning  about  concurrent  systems  that 
does  not  assume  an  atomic  grain  of  action. 

Part  II:  The  Intersecting  Broadcast  Machine  is  a  novel  array  processor  ar¬ 
chitecture,  capable  of  processing  efficiently  programs  whose  arbitrary 
or  complex  structure  would  make  them  difficult  to  map  onto  conven¬ 
tional  array  processors.  The  architecture  also  supports  fault-tolerant 
operation. 

Part  III:  Broadcast  Protocols  for  Distributed  Systems  considers  how  the 
broadcast  character  of  communications  media  such  as  Ethernet  and 
packet  radio  can  be  exploited  to  yi  dd  reliable  communication  with 
very  little  overhead.' 

Part  rV:  Extending  Interval  Logic  to  Real  Time  Systems  presents  a  tech¬ 
nique  for  the  formal  expression  of  the  real-time  constraints  that  are 
critical  to  the  specification  of  fault-tolerant  distributed  systems.' 

Part  V:  Consistency  of  Replicated  Information  in  Multichannel  Fault  Tol¬ 
erant  Systems  considers  the  possibility  of  using  similar,  but  not  identi¬ 
cal,  processing  in  the  replicas  of  a  fault  tolerant  system.  Conventional 
fault  tolerant  systems  using  replicated  processing  require  the  replicas 
to  be  identical,  so  that  they  can  be  compared  by  exact  match  algo¬ 
rithms.  This  exact  replication  increases  the  risk  that  a  common  fault 
will  affect  all  replicas  and  cause  system  failure. 

Part  "VI:  Experimental  Implementation  and  Evaluation  of  the  TRA.N’S 
Broadcast  Protocol  describes  an  implementation  and  evaluation  of  the 
broadcast  protocol  outlined  in  Part  III. 
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Communication 
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1.1  On  Interprocess  Communication 

All  communication  ultimately  involves  a  communication  medium  whose 
state  is  changed  by  the  sender  and  observed  by  the  receiver.  A  sending 
processor  changes  the  voltage  on  a  wire  and  a  receiving  processor  observes 
the  voltage  change;  a  speaker  changes  the  vibrational  state  of  the  air  and 
a  listener  senses  this  change. 

Communication  acts  can  be  divided  into  two  classes:  transient  and  per¬ 
sistent.  In  a  transient  communication,  the  medium’s  state  is  changed  only 
for  the  duration  of  the  communication,  immediately  afterwords  reverting  to 
its  “normal”  state.  A  message  sent  on  an  ethernet  modifies  the  transmission 
medium’s  state  only  while  the  message  is  in  transit;  the  altered  state  of  the 
air  lasts  only  while  the  speaker  is  talking.  In  a  persistent  communication, 
the  state  change  remains  after  the  sender  has  finished  its  communication. 
Setting  a  voltage  level  on  a  wire,  writing  on  a  blackboard,  and  raising  a 
fiag  on  a  flagpole  are  all  examples  of  persistent  communication. 

Transient  communication  is  possible  only  if  the  receiver  is  observing  the 
communication  medium  while  the  sender  is  modifying  it.  This  implies  an  a 
priori  synchronization — the  receiver  must  be  waiting  for  the  communication 
to  take  place.  Communication  between  truly  asynchronous  processes  must 
be  persistent,  the  sender  changing  the  state  of  the  medium  and  the  receiver 
able  to  sense  that  change  at  a  later  time. 

Message  passing  is  often  considered  to  be  a  form  of  transient  communi¬ 
cation  between  asynchronous  processes.  However,  a  closer  examination  of 
asynchronous  message  passing  reveals  that  it  involves  a  persistent  commu¬ 
nication.  Messages  are  placed  in  a  buffer  that  is  periodically  tested  by  the 
receiver.  Viewed  at  a  low  level,  message  passing  is  typically  accomplished 
by  putting  a  message  in  a  buffer  and  setting  an  interrupt  bit  that  is  tested 
on  every  machine  instruction.  The  receiving  process  actually  consists  of 
two  asynchronous  subprocesses:  a  wain  process  that  is  usually  thought  of 
as  the  receiver,  and  an  input  process  that  continuously  monitors  the  com¬ 
munication  medium  and  puts  messages  in  the  buffer.  The  input  process  is 
synchronized  with  the  sender  (it  is  a  “slave”  process)  and  communicates 
asynchronously  with  the  main  process  using  the  buffer  as  a  medium  for 
persistent  communication. 
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The  subject  of  this  report  is  asynchronous  interprocess  communication, 
so  only  persistent  communication  is  considered.  Moreover,  we  will  restrict 
ourselves  to  unidirectional  communication,  in  which  only  a  single  process 
can  modify  the  state  of  the  medium.  With  this  restriction,  two-way  commu¬ 
nication  requires  at  least  two  separate  communication  media,  one  n.odified 
by  each  process.  However,  multiple  receivers  will  be  considered.  We  also 
restrict  our  attention  to  discrete  systems,  in  which  the  medium  has  a  finite 
number  of  distinguishable  states.  The  sender  can  therefore  set  the  medium 
to  one  of  a  fixed  number  of  persistent  states,  and  the  receiver(s)  can  observe 
the  medium’s  state. 

The  form  of  persistent  communication  that  we  have  described  is  more 
commonly  known  as  a  shared  register,  where  the  sender  and  receiver  are 
called  the  writer  and  reader,  respectively,  and  the  state  of  the  communica¬ 
tion  medium  is  known  as  the  value  of  the  register.  We  will  use  these  in  the 
rest  of  this  paper,  so  we  will  consider  finite-valued  registers  with  a  single 
writer  and  one  or  more  readers. 

While  the  practical  applications  of  the  algorithms  described  in  this  pa¬ 
per  will  be  to  “small”  registers,  the  larger  purpose  is  to  develop  insight 
into,  and  formal  methods  for  reasoning  about,  nonatomic  operations  to 
data  objects.  In  the  realm  of  conventional  database  theory,  atomicity  is 
usually  called  “serializability”.  Moreover,  although  the  notation  used  in 
describing  the  algorithms  suggests  a  shared-memory  implementation,  these 
are  really  distributed  algorithms,  since  each  shared  register  is  modified  by 
only  a  single  process.  Thus,  the  results  described  here  can  be  regarded  as 
a  preliminary  investigation  of  nonserializable  operations  in  a  distributed 
database. 

In  assuming  a  single  writer,  we  rule  out  the  possibility  of  concurrent 
writes  (to  the  same  register).  Since  a  reader  only  senses  the  value,  there  is 
no  reason  why  a  read  operation  must  interfere  with  another  read  or  write 
operation.  (While  reads  do  interfere  with  other  operations  in  some  forms 
of  memory,  such  as  magnetic  core,  this  interference  is  an  idiosynchracy  of 
the  particular  technology  rather  than  an  inherent  property  of  reading.)  We 
therefore  assume  that  a  read  does  not  affect  any  other  read  or  any  write. 
However,  it  is  not  clear  what  effect  a  concurrent  write  should  have  on  a 
read. 
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In  concurrent  programming,  one  traditionally  assumes  that  a  writer  has 
exclusive  access  to  shared  data,  making  concurrent  reading  and  writing  im¬ 
possible.  This  assumption  is  enforced  either  by  requiring  the  programming 
language  to  provide  the  necessary  exclusive  access,  or  by  implementing 
the  exclusion  with  a  “readers- writers”  protocol  [3).  Such  an  approach  re¬ 
quires  that  a  reader  must  wait  while  a  writer  is  accessing  the  register, 
and  vice-versa.  Moreover,  any  method  for  achieving  such  exclusive  access, 
whether  implemented  by  the  programmer  or  the  compiler,  requires  a  lower- 
level  shared  register.  At  some  level,  the  problem  of  concurrent  access  to  a 
shared  register  must  be  faced.  It  is  this  problem  that  will  be  addressed,  so 
we  eschew  any  approach  that  requires  one  process  to  wait  for  another. 

Asynchronous  concurrent  access  to  shared  registers  is  usually  considered 
only  at  the  hardware  level,  so  it  is  at  this  level  that  the  methods  developed 
here  could  have  some  direct  application.  However,  concurrent  access  to 
shared  data  occurs  at  high  levels  of  abstraction.  One  cannot  allow  any  single 
process  exclusive  access  to  the  entire  social  security  system’s  database. 
While  algorithms  for  implementing  a  single  register  cannot  be  applied  to 
such  a  database,  we  hope  that  the  formalism  developed  for  analyzing  these 
algorithms  will  eventually  prove  useful  for  analyzing  concurrent  systems  at 
a  higher-level.  Nevertheless,  it  is  probably  best  to  think  of  a  register  as 
a  low-level  component,  probably  implemented  in  hardware,  when  reading 
this  paper. 

Hardware  implementations  of  asynchronous  communication  often  make 
assumptions  about  the  relative  speeds  of  the  communicating  processes. 
Such  assumptions  can  lead  to  simplifications.  For  example,  the  problem 
of  constructing  an  atomic  register,  discussed  below,  is  shown  to  be  easily 
solved  by  assuming  that  two  successive  reads  of  a  register  cannot  be  concur¬ 
rent  with  a  single  write.  If  one  knows  how  long  a  write  can  take,  a  delay  can 
be  added  between  successive  reads  to  ensure  that  this  assumption  holds. 
We  make  no  such  assumptions  about  process  speeds.  The  results  therefore 
apply  even  to  communication  between  processes  of  vastly  differing  speeds. 

We  therefore  make  no  assumptions  about  relative  process  speed  and 
consider  a  shared  register  in  which  a  read  can  overlap  (be  concurrent  with) 
a  write.  Three  possible  assumptions  about  what  can  happen  when  a  read 
overlaps  one  or  more  writes  are  considered. 
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The  weakest  possibility  is  a  safe  register,  in  which  the  only  assumption 
made  about  the  value  obtained  by  a  read  that  overlaps  a  write  is  that  the 
read  obtain  one  of  the  possible  values  of  the  register — for  example,  a  read 
of  a  boolean-valued  register  must  obtain  either  true  or  false.  A  read  that 
is  not  concurrent  with  a  write  is  assumed  to  obtain  the  correct  value — that 
is,  the  most  recently  written  one.  However,  a  read  that  overlaps  a  write 
may  return  any  possible  value. 

The  next  stronger  possibility  is  a  regular  register,  which  is  safe  (a  read 
not  concurrent  with  a  write  gets  the  correct  value)  and  in  which  a  read 
that  overlaps  a  write  obtains  either  the  old  or  new  value.  More  generally, 
a  read  that  overlaps  any  series  of  writes  obtains  either  the  value  before  the 
first  of  the  writes  or  one  of  the  values  being  written. 

The  final  possibility  is  an  atomic  register,  which  is  safe  and  in  which 
reads  and  writes  behave  as  if  they  occurred  in  some  definite  order.  In 
other  words,  for  any  execution  of  the  system,  there  is  some  way  of  totally 
ordering  the  reads  and  writes  so  that  the  values  returned  by  the  reads  are 
the  same  as  if  the  operations  had  been  performed  in  that  order,  with  no 
overlapping.  (It  is  also  required  that  this  ordering  should  be  a  reasonable 
one;  the  precise  condition  is  stated  below.) 

A  regular  register  is  obviously  stronger  than  a  safe  one,  since  it  places 
a  condition  on  the  value  returned  by  a  read  that  overlaps  a  write.  An 
atomic  register  is  stronger  than  a  regular  one  because,  if  two  successive 
reads  overlap  the  same  write,  then  a  regular  register  allows  the  first  read  to 
obtain  the  new  value  and  the  second  read  the  old  value.  This  is  forbidden 
in  an  atomic  register,  in  which  the  only  allowed  possibilities  are  old-old, 
new-new,  and  old-new.  In  fact,  it  will  be  shown  that  a  regular  register  is 
atomic  if  and  only  if  two  successive  reads  that  overlap  the  same  write  cannot 
obtain  the  new  then  the  old  value.  Thus,  a  regular  register  is  automatically 
an  atomic  one  if  two  successive  reads  cannot  overlap  the  same  write. 

These  are  the  only  three  general  classes  of  register  that  we  have  been 
able  to  think  of.  Each  class  merits  study.  Safety  seems  to  be  the  weakest 
requirement  that  allows  useful  communication;  we  do  not  know  how  to 
achieve  any  form  of  interprocess  synchronization  with  a  weaker  assumption. 
Regularity  asserts  that  a  read  returns  a  “reasonable”  value,  and  seems 
to  be  a  natura'  requirement.  Atomicity  is  the  most  common  assumption 
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made  about  shared  registers,  and  is  provided  by  current  multiport  computer 
memories.'  At  a  lower  level,  such  as  interprocess  communication  within  a 
single  chip,  only  safe  registers  are  provided;  other  classes  of  register  must 
be  implemented  using  safe  ones. 

Any  method  of  implementing  a  single-writer  register  can  be  classified 
by  three  “coordinates”  with  the  following  values: 

•  safe,  regular,  or  atomic,  according  to  the  strongest  assumption  that 
the  register  satisfies. 

•  boolean  or  multivalued,  according  to  whether  the  metho-^  n-oduces 
only  boolean  registers  or  registers  with  any  desired  number  of  values. 

•  stngle-reader  or  multireader,  according  to  whether  the  method  yields 
registers  with  only  one  reader  or  with  any  desired  number  of  readers. 

This  produces  twelve  classes  of  implementations,  partially  ordered  by 
“strength” — for  example,  a  method  that  produces  atomic,  multivalued, 
multireader  registers  is  stronger  than  one  producing  regular,  multivalued, 
single-reader  registers.  In  this  paper,  we  address  the  problem  of  imple¬ 
menting  a  register  of  one  class  using  one  or  more  registers  of  a  weaker 
class. 

The  weakest  class  of  register,  and  therefore  the  easiest  to  implement,  is  a 
safe,  boolean,  single-reader  one.  This  seems  to  be  the  most  natural  kind  of 
register  to  implement  with  current  hardware  technology,  requiring  only  that 
the  writer  set  a  voltage  level  either  high  or  low  and  that  the  reader  test  this 
level  without  disturbing  it.  A  series  of  constructions  of  stronger  registers 
from  weaker  ones  is  presented  that  allows  almost  every  class  of  register 
to  be  constructed  starting  from  this  weakest  class.  The  one  exception  is 
that  constructing  an  atomic,  multireader  register  from  any  weaker  one  is 
still  an  open  problem.  Most  of  the  constructions  are  simple;  the  difficult 
ones  are  Construction  4  that  implements  an  m-reader  multivalued  regular 
register  using  m-reader  boolean  regular  registers,  and  Construction  5  that 

'However,  the  standard  implementation  of  a  multiport  memory  does  not  meet  our  re¬ 
quirements  for  an  asynchronous  register  because,  if  two  processes  concurrently  access  a 
memory  cell,  one  must  wait  for  the  other. 
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implements  a  single-reader  multivalued  atomic  register  using  single-reader 
multivalued  regular  registers. 

We  have  defined  three  classes  of  shared  registers  for  asynchronous  in¬ 
terprocess  communication,  and  provided  algorithms  for  implementing  one 
class  in  terms  of  a  weaker  class.  For  single-writer  registers,  the  only  un¬ 
solved  problem  is  implementing  a  multi-reader  atomic  register.  A  solution 
probably  exists,  but  it  undoubtedly  requires  that  a  reader  communicate 
with  all  other  readers  as  well  as  with  the  writer.  Also,  more  efficient  im¬ 
plementations  than  Constructions  4  and  5  probably  exist.  For  multivalued 
registters,  Peterson’s  algorithm  (11)  combined  with  Construction  5  provides 
a  more  efficient  implementation  of  a  regular  register  than  Construction  4, 
and  a  more  efficient  implementation  of  a  single-reader  atomic  register  than 
Construction  5.  However,  in  this  solution,  Construction  4  is  still  needed  to 
implement  the  regular  register  used  in  Construction  5. 

We  have  not  addressed  the  question  of  multi-writer  shared  registers.  It 
is  not  clear  what  assumptions  one  should  make  about  the  effect  of  over¬ 
lapping  writes.  The  one  case  that  is  straightforward  is  that  of  an  atomic 
multi-writer  register — the  kind  of  register  traditionally  assumed  in  shared- 
variable  concurrent  programs.  This  raises  the  problem  of  implementing 
a  multi-writer  atomic  register  from  single-writer  ones.  An  unpublished 
algorithm  of  Bard  Bloom  implements  a  two-writer  atomic  register  using 
single-writer  atomic  registers. 

In  addition  to  studying  shared  registers,  we  have  also  developed  a  for¬ 
malism  for  reasoning  about  concurrent  systems  that  is  not  based  upon 
atomic  actions.  Starting  from  a  more  general,  relativistic  viewpoint,  we 
showed  that  one  can,  with  no  essential  loss  of  generality,  think  in  terms 
of  starting  and  finishing  times  of  operations.  While  starting  and  finishing 
times  are  intuitively  more  appealing,  and  can  be  useful  in  proving  metathe¬ 
orems  about  general  systems,  rigorous  reasoning  about  specific  algorithms 
is  best  done  in  the  general  formalism,  using  Axioms  A1-A5.  These  axioms 
seem  to  contain  the  fundamental  properties  of  temporal  relations  among 
operation  executions  that  are  needed  to  analyze  concurrent  algorithms. 
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1.2  The  Constructions 


In  this  section,  the  algorithms  for  constructing  different  classes  of  regis¬ 
ters  are  described  and  informally  justified.  Rigorous  correctness  proofs  are 
postponed  until  Section  1.4,  after  the  necessary  formalism  is  developed. 

The  algorithms  are  described  by  indicating  how  a  write  and  a  read  are 
performed.  I  will  not  bother  to  indicate  the  initial  state  of  the  shared 
registers — it  is  the  one  that  would  result  from  writing  the  initial  value 
starting  from  any  arbitrary  state. 

The  first  construction  implements  a  multireader  safe  or  regular  register 
from  single-reader  ones.  It  uses  the  obvious  method  of  having  the  writer 
simply  maintain  a  separate  copy  of  the  register  for  each  reader.  The  for 
all  statement  denotes  that  its  body  is  executed  once  for  each  of  the  indi¬ 
cated  values  of  i;  these  separate  executions  can  be  done  in  any  order  or 
concurrently. 

Construction  1  Let  v\,  . . .  ,  be  singU-reader,  n-valued  registers,  where 
each  Vi  can  be  written  by  the  same  writer  and  read  by  process  i,  and  con- 
struct  a  single  unvalued  register  v  in  which  the  operation  v  p  is  performed 
as  follows: 

for  all  I  in  {1, ...  ,m} 
do  tv  :=  p  od 

and  process  i  reads  v  by  reading  the  value  of  V{.  If  the  v,  are  safe  or  regular 
registers,  then  v  is  a  safe  or  regular  register,  respectively. 

Any  read  by  process  i  that  does  not  overlap  a  write  of  v  does  not  overlap 
a  write  of  t',.  If  u,  is  safe,  then  this  read  gets  the  correct  value,  which  shows 
that  i'  is  safe,  if  a  read  of  r,  by  process  i  overlaps  a  write  of  i’,,  then  it 
overlaps  the  write  of  the  same  value  to  v.  It  follows  easily  from  this  that  if 
V,  is  regular,  then  v  is  also  regular. 

This  construction  does  not  make  v  an  atomic  register  even  if  the  t\  are 
atomic.  If  reads  by  tw-o  different  processes  i  and  j  both  overlap  the  same 
write,  it  is  possible  for  i  to  get  the  new  value  and  j  the  old  value  even 
though  the  read  by  i  precedes  the  read  by  j — a  possibility  no  allowed  by 
an  atomic  register. 
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The  next  construction  is  also  trivial;  it  implements  an  n-bit  safe  register 
from  n  single-bit  ones. 


Construction  2  Let  Vi,  Vn  be  boolean  m-reader  registers,  each  written 
by  the  same  writer  and  read  by  the  same  set  of  readers.  Let  v  be  the  2"- 
valued,  m-reader  register  in  which  the  number  with  binary  representation 
fii . .  .pin  is  written  by 

for  all  t  in  {1, . . . ,  m}  do  u,  :=  /i,  od 

and  in  which  the  value  is  read  by  reading  all  the  V{.  If  each  i>,-  is  safe,  then 
V  is  safe. 


The  register  v  is  not  regular  even  if  the  u,-  are.  A  read  can  return  any 
value  if  it  overlaps  a  write  that  changes  the  register’s  value  from  0 ...  0  to 
1...  1. 

The  next  construction  shows  that  it  is  trivial  to  implement  a  boolean 
regular  register  from  a  safe  boolean  register.  In  a  safe  register,  a  read  that 
overlaps  a  write  may  get  any  value,  while  in  a  regular  register  it  must  get 
either  the  old  or  new  value.  However,  a  read  of  a  safe  boolean  register 
must  obtain  either  true  or  false  on  any  read,  so  it  must  return  either  the 
old  or  new  value  if  it  overlaps  a  write  that  changes  the  value.  A  boolean 
safe  register  can  fail  to  be  regular  only  if  a  read  that  overlaps  a  write  that 
does  not  change  the  value  returns  the  other  (wrong)  value.  To  prevent  this 
possibility,  one  simply  does  not  perform  a  write  that  does  not  change  the 
value. 

Construction  3  Let  v  be  an  m-reader  boolean  register,  and  let  x  be  a 
variable  internal  to  the  writer  (not  a  shared  register)  initially  equal  to  the 
initial  value  of  v.  Define  v*  to  be  the  m-reader  boolean  register  in  which  the 
write  operation  v*  n  is  performed  as  follows: 


if  X  fi  then  v  :=  ft; 

x-.-fi 


fi 


and  a  read  of  v*  is  performed  by  reading  v.  If  v  is  safe  then  v*  is  regular. 


LO 


There  are  two  known  algorithms  for  implementing  a  multivalued  regular 
register  from  boolean  ones.  The  simpler  one  employs  a  unarj'  encoding,  in 
which  the  value  is  denoted  by  zeros  in  bits  0  through  /i  —  1  and  a  one  in 
bit  n-  A  reader  reads  the  bits  from  left  to  right  (0  to  n)  until  it  finds  a  one. 
To  write  the  value  /i,  the  writer  first  sets  to  one  and  then  sets  bits  fi  —  I 
through  1  to  zero,  writing  from  right  to  left.  (The  idea  of  implementing 
shared  data  by  reading  and  writing  its  components  in  different  directions 
was  also  used  in  [4].) 

Construction  4  Let  t’l,  . ..  ,  be  boolean,  m-reader  registers,  and  let  v  be 
the  n~valued,  m-reader  register  in  which  the  operation  v  :=  p  is  performed 
by 

:=  1,- 

for  i  :=  fi  —  \  step  —1  until  1  do  v,  :=  0  od 
and  a  read  is  performed  by: 

fi  ;=  1; 

while  Vfj  =  0  do  fi  :=  fi  -h  1  od; 

return  fi 

If  each  Vi  is  regular,  then  v  is  regular. 

The  correctness  of  this  algorithm  is  not  at  all  obvious.  Indeed,  it  is  not 
even  obvious  that  the  while  loop  in  the  read  operation  does  not  “fall  off 
the  end”  and  tr>’  to  read  the  nonexistent  register  Un+j.  This  can’t  happen 
because  whenever  the  writer  WTites  a  zero,  there  is  a  one  to  the  right  of 
it.  (Since  I  am  assuming  that  an  initial  value  has  been  written,  some  v, 
initially  equals  one.)  As  an  exercise,  the  reader  of  this  paper  can  convince 
himself  that  whenever  a  reading  process  sees  a  one,  it  was  written  by  either 
a  concurrent  write  or  by  the  most  recent  preceding  one,  so  v  is  regular. 
The  formal  proof  is  given  in  Section  1.4. 

The  value  of  is  only  set  to  one,  never  to  zero.  It  can  therefore  be 
eliminated;  the  writer  simply  never  writes  it  and  the  reader  assumes  its 
value  is  one  instead  of  reading  it.  I  will  not  bother  writing  down  this 
modification. 


Even  if  all  the  u,-  are  atomic,  Construction  4  does  not  produce  an  atomic 
register.  To  see  this,  suppose  that  the  register  initially  has  the  value  3,  so 
V,  =  uj  =  0  and  Wj  =  It  the  writer  first  writes  the  value  1  then  the  value 
2,  and  there  are  two  successive  read  operations.  This  can  produce  the 
following  sequence  of  actions; 

•  the  first  read  finds  Vt  =  0 

•  the  first  write  sets  Uj  :=  1 

•  the  second  write  sets  t;2  •=  1 

•  the  first  read  finds  =  I  and  returns  the  value  2 

•  *^.0  second  read  finds  t'l  =  1  and  returns  the  value  1. 

In  this  scenario,  the  first  read  obtains  a  newer  value  (the  one  written  by 
the  second  write)  than  the  second  read  (which  obtains  the  one  written  by 
the  first  write),  even  though  it  precedes  the  second  read.  This  shows  that 
the  register  is  not  atomic. 

Construction  4  uses  n  -  1  boolean  regular  registers  to  make  an  n-valued 
one,  so  it  is  practical  only  for  small  values  of  n.  We  would  like  an  algorithm 
that  requires  O(logn)  boolean  registers  to  construct  an  n-valued  register. 
The  second  method  for  constructing  a  regular  multivalued  register  uses  an 
algorithm  of  Peterson  [11]  that  implements  an  m-reader  n-valued  atomic 
register  with  m  +  2  safe  m-reader  registers;  2m  atomic  boolean  2-reader 
registers,  and  two  atomic  boolean  m-reader  registers.  There  is  no  known 
algorithm  for  constructing  multivalued  m-reader  atomic  registers  from  sim¬ 
pler  ones.  However,  we  can  apply  Peterson’s  algorithm  to  construct  an  n- 
valued  single-reader  atomic  register  using  three  safe  single-reader  n-valued 
registers  and  four  single-reader  atomic  boolean  registers.  The  safe  registers 
can  be  implemented  with  Construction  2,  and  the  atomic  boolean  registers 
can  be  implemented  with  Construction  5  below.  Since  an  atomic  register 
is  regular.  Construction  1  can  then  be  used  to  make  an  m-reader  n-valued 
regular  register  from  0(3mlogn)  single-reader  boolean  regular  registers. 

Before  giving  the  algorithm  for  constructing  a  two-reader  atomic  regis¬ 
ter,  I  prove  a  result  that  indicates  why  no  trivial  algorithm  will  work.  It 
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asserts  that  there  can  be  no  algorithm  in  which  the  writer  only  writes  and 
the  reader  only  reads;  any  algorithm  must  involve  two-way  communication 
between  the  reader  and  the  writer. 

Theorem:  There  exists  no  algorithm  to  implement  an  atomic  register  using 
only  a  finite  number  of  regular  registers  that  can  be  written  by  the  writer 
(of  the  atomic  register). 

Proof  :  I  assume  such  an  algorithm  and  derive  a  contradiction.  Without 
loss  of  generality,  I  can  assume  that  there  is  only  a  single  regular  register  v 
written  by  the  writer  and  read  by  the  reader.  (Any  algorithm  that  works 
with  multiple  registers  must  also  work  when  those  registers  are  combined 
into  a  single  large  regular  register.) 

Let  V*  denote  the  atomic  register  that  is  being  implemented.  Suppose 
that  the  writer  performs  an  infinite  number  of  writes  that  change  the  value 
of  V*.  There  must  be  some  pair  of  values  assumed  by  v*,  call  them  0  and  1. 
such  that  there  are  an  infinite  number  of  writes  that  change  t'*’s  value  from 
0  to  1.  Since  v  can  assume  only  a  finite  number  of  values  (the  hypothesis 
states  that  the  original  algorithm  has  only  a  finite  number  of  registers,  and 
all  registers  are  taken  to  have  only  a  finite  number  of  possible  values),  there 
must  exist  values  Vq,  Vn  of  v  such  that  I’o  is  the  final  value  of  v  after 
each  one  of  an  infinite  number  of  writes  of  0  to  v*,  v„  is  the  final  value  of  v 
after  each  one  of  an  infinite  number  of  writes  of  1  to  v*,  and,  for  each  i  <  n. 
the  value  of  v  is  changed  from  u,-  to  v,+i  during  infinitely  many  writes  that 
change  the  value  of  v*  from  0  to  1. 

A  read  of  v*  may  involve  several  reads  of  v.  However,  by  considering 
only  scenarios  in  which  each  of  those  reads  of  v  obtains  the  same  value, 
we  may  assume  that  each  read  of  v*  reads  v  only  once.  Since  v  assumes 
each  value  u,-  infinitely  often,  it  must  be  possible  for  a  sequence  of  n  -f  1 
consecutive  reads  to  obtain  the  values  v„,  v„_i,  ...  ,  Vi. 

The  read  that  finds  v  equal  to  v,-  and  the  subsequent  read  that  finds  v 
equal  to  i',-i  could  both  have  overlapped  the  same  write  of  v,  which  could 
have  been  a  write  that  occured  in  the  process  of  changing  u”s  value  from  0 
to  1.  Therefore,  if  the  read  of  v*  that  finds  v  equal  to  u,-  returns  the  value 
1,  then  the  subsequent  read  that  finds  v  equal  to  v,_i  must  also  return  the 
value  1,  since  both  reads  could  be  overlapping  the  same  write  and,  in  that 
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case,  two  successive  reads  of  an  atomic  register  cannot  return  first  the  new 
then  the  old  value. 

The  first  read,  which  finds  t;  equal  to  t;„,  must  return  the  value  1,  since  it 
could  have  occurred  after  the  completion  of  a  write  of  1.  By  induction,  this 
implies  that  the  last  read,  which  found  v  equal  to  vq,  must  return  the  value 
1.  However,  this  read  could  have  occurred  after  a  write  of  0  and  before  any 
subsequent  write,  so  returning  the  value  1  would  violate  the  assumption 
that  the  register  v*  is  safe.  (An  atomic  register  is  a  fortiori  safe.)  This  is 
the  required  contradiction.  I 

This  theorem  could  be  expressed  and  proved  using  the  formalism  devel¬ 
oped  below,  but  doing  so  would  lead  to  no  new  insight.  The  formal  proof 
of  this  theorem  is  therefore  left  as  an  exercise  for  the  compulsive  reader. 

The  theorem  is  false  if  no  bound  is  placed  on  the  number  of  values  a 
register  can  hold.  Given  a  regular  register  v  that  can  assume  an  unbounded 
number  of  values,  an  atomic  register  u*  is  implemented  as  follows.  The 
writer  sets  v  equal  to  a  pair  consisting  of  the  value  of  i»*  and  a  sequential 
version  number.  The  reader  reads  v  and  compares  the  version  number  with 
the  previous  one  it  read.  If  the  new  version  number  is  higher,  then  it  uses 
the  value  it  just  read;  if  the  new  version  number  is  lower,  then  it  forgets 
the  value  and  version  number  it  just  read  and  uses  the  previously-read 
value.  The  correctness  of  this  algorithm  follows  easily  from  Proposition  9 
of  Section  1.3.3.  By  assuming  registers  hold  only  a  bounded  set  of  values, 
I  am  disallowing  such  algorithms. 

Finally,  we  come  to  the  algorithm  for  constructing  a  single- reader  atomic 
register  from  regular  ones.  To  begin,  we  try  to  implement  an  atomic  register 
V*  with  a  regular  register  v  that  bolds  a  pair  of  values,  both  normally  equal. 
VN’hen  v  is  changed  from  (i',  t^)  (denoting  v*  =  i/)  to  (^,//)  (denoting  t  ’  = 
/i),  it  is  first  set  to  the  intermediate  value  The  reader  reads  r  and 

returns  the  first  component  unless  it  obtains  after  having  returned 

the  value  //  the  last  time,  in  which  case  it  must  return  the  value  p  to  avoid 
a  “new-old”  sequence. 

The  preceding  theorem  shows  that  this  idea,  by  itself,  is  not  enough. 
The  reader  is  in  a  quandary  if  three  successive  reads  of  v  obtain  the  v.al- 
ues  (p.p),  (i^, p),  and  The  first  read  simply  returns  p;  as  I  just 


observed,  the  second  read  must  also  return  but  what  can  the  third  read 
return?  The  second  and  third  reads  could  both  have  overlapped  a  single 
write  that  is  changing  the  value  from  u  to  fi,  so  returning  u  would  produce 
a  new-old  sequence.  On  the  other  hand,  the  third  read  could  have  seen 
a  completely  new  value,  written  long  after  the  write  that  overlapped  the 
second  read,  so  returning  fx  could  violate  safety — the  requirement  that  a 
read  not  overlapping  any  write  return  the  correct  value. 

To  overcome  this  problem,  I  add  another  bit  to  u,  which  I  will  call  the 
color  value.  When  the  reader  reads  u,  it  sets  a  shared  one-bit  register  cr 
to  u’s  color  value.  The  writer  first  reads  the  register  cr  and  sets  v  to  the 
opposite  color.  (Thus,  the  reader  tries  to  make  cr  and  v's  color  the  same, 
and  the  writer  tries  to  make  them  different.)  The  reader  interprets  (i/,  /i)  as 
a  /I  only  if  its  previous  read  saw  a  of  the  same  color.  The  only  source  of 
embarrassment  is  now  if  three  successive  reads  return  values  (t',//), 

and  that  are  all  the  same  color.  It  will  be  shown  in  Section  4  that 

this  can  happen  only  if  the  last  read  actually  overlaps  the  write  of  {u,  fi),  so 
it  is  allowed  to  return  the  value  ft  without  violating  the  safety  requirement. 

In  the  following  construction,  the  variable  cr  is  written  by  the  reader 
and  read  by  both  the  reader  and  the  writer.  A  two-reader  register  is  not 
needed,  since  the  reader  can  maintain  a  local  variable  containing  the  value 
that  it  last  wrote  into  cr.  (This  is  just  Construction  1  with  m  =  2  and  the 
writer  being  the  second  reader.)  Such  a  local  variable  would  complicate  the 
description,  so  it  is  omitted.  In  the  reader’s  program,  the  primed  variables 
denote  the  values  read  the  previous  time,  e.xcept  that  if  the  reader  reads 
(/i,/i)  then  (I'.ft),  both  with  the  same  color,  then  it  “forgets  about"  the 
latter  value. 

Construction  5  Let  V  be  an  n-element  set;  let  w  and  r  be  processes;  let 
v.cw  denote  a  single  2n‘-valued  register  that  can  be  written  by  w  and  read 
by  r.  where  v  has  a  value  in  V  x  V  and  cw  is  boolean  valued:  and  let  cr  be  a 
boolea:  register  that  can  be  written  by  r  and  read  by  w.  Define  the  n-valued 
register  v* .  with  values  in  V,  written  by  w  and  read  by  r  by  letting  the  write 
V*  :=  fi  be  performed  by: 

V,  cw  :=  (I'l ,  fi),  -'cr; 

V,  cw  :=  [fi,  fi),  cw 
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and  letting  the  read  operation  be  performed  by  the  program  of  Figure  1.1, 
where  x  and  x*  are  local  variables  in  V  x  V,  er*  is  a  boolean-valued  local 
variable,  and  rtn  is  a  local  variable  with  values  in  V  whose  final  value  is 
the  one  returned  by  the  read.  Initially,  x* ,  <rr'  equals  (w, 


r,  cr  :=  u,  cw\ 
if  cr  =  cr' 

then  if  Xi  =  X2 

then  if  Xj  =  x,  /  Xj  A  rtn  =  x!. 
then  skip 
else  x'  ;=  r; 

Wn  :=  Xi 

fi 

else  if  (x  =  x'  A  Wn  =  X2)  V  x\  =  x'j  =  X2 
then  x'  ;=  x; 

rfn  :=  X2 
else  x'  :=  x; 

Wn  ;=  Xi 

fi 

fi 

else  x',  cr' ;=  X,  cr; 
rfn  :=  Xx 


Figure  1.1:  Construction  5:  the  render’s  algorithm. 
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1.3  The  Formal  Model 
1.3.1  System  Executions 

Almost  all  models  of  concurrent  processes  are  based  upon  indivisible  atomic 
actions  as  their  primitive  elements.  For  example,  models  in  which  a  process 
is  represented  by  a  sequence  or  “trace”  [1,12,13]  assume  that  each  element 
in  the  sequence  represents  an  indivisible  action.  Net  models  [2]  and  re¬ 
lated  formalisms  [9,10]  assume  that  the  firing  of  an  individual  transition 
is  atomic.  Operations  to  a  nonatomic  shared  register  cannot  be  modeled 
as  atomic  actions,  since  these  formalisms  have  no  concept  of  two  atomic 
actions  overlapping  in  time. 

One  can  model  a  single  read  or  write  operation  with  two  atomic  actions: 
a  start  and  a  finish  action.  I  will  employ  such  a  model  to  motivate  the 
formalism.  However,  in  the  general  view  of  physical  systems  based  upon 
special  relativity  that  is  discussed  in  [7]  and  [5],  there  may  be  no  single  real 
event  that  precedes  all  other  events  in  the  operation,  and  no  single  event 
that  follows  all  others.  I  will  show  that  assuming  such  fictitious  start  and 
finish  events  would  result  in  no  loss  of  generality.  However,  it  turns  out  to 
be  easier  to  reason  directly  in  terms  of  the  nonatomic  actions  than  to  use 
starting  and  finishing  events. 

I  therefore  eschew  more  conventional  formalisms  in  favor  of  one  intro¬ 
duced  in  [6]  and  refined  in  [5],  in  which  the  primitive  elements  are  operation 
executions  that  are  not  assumed  to  be  atomic.  In  this  formalism,  an  execu¬ 
tion  of  a  system  is  represented  as  a  triple  5, — -,  where  5  is  a  finite  or 
countably  infinite  set  of  operation  executions,  and  — ►  and  — >  are  prece¬ 
dence  relations  on  5. 

The  most  general  way  of  viewing  the  formalism  is  to  consider  an  oper¬ 
ation  execution  to  be  a  set  of  points  in  four-dimensional  space-time.  Such 
a  view  is  provided  in  [51.  While  using  the  same  formalism  as  [5],  I  will 
employ  a  less  general  but  more  intuitive  model.  In  this  model,  an  opera¬ 
tion  execution  A  is  thought  of  as  an  activity  performed  during  some  time 
inteiwal  [s^,/^],  where  the  real  numbers  Sa  and  /a  are  the  starting  and 
finishing  times  of  A.  I  assume  that  at  any  time,  only  a  finite  number  of 
operation  executions  have  begun.  Stated  formally,  a  model  consists  of  a 
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set  S  of  operation  executions,  together  with  real-valued  functions  s  and  / 
on  S  such  that  the  following  conditions  hold  for  all  .4  and  5  in  5  (where  I 
write  Sa  and  /a  instead  of  s(A)  and  f(A)): 

Ml.  Sa  <  /a 

M2,  for  any  real  number  t:  {A  :  Sa  <  t}  \s  finite 

An  operation  execution  A  is  said  to  be  instantaneous  if,  for  any  B  ^  A.  the 
numbers  sb  and  /b  lie  outside  the  interval  [s^i  Ia]-  Thus,  A  is  instantaneous 
if  and  only  if  we  can  set  Sa  equal  to  /a  (shrinking  the  interval  to  a  point) 
without  changing  the  relative  order  of  any  starting  and  finishing  times. 
Given  such  a  model,  we  can  define  the  relations  — ►  and  — >  as  follows: 

A  — *  B  =  fA<SB 

A  — >  B  =  Sa  <  fa  (1) 

Thus,  A  — ►  B  means  that  A  finishes  before  B  starts,  and  A--^  B  means 
that  A  starts  no  later  than  B  finishes.  We  read  .4  — ►  B  as  “.4  precedes 
B”  and  A  -  -  B  zs  ’^A  can  affect  B” . 

Ml,  M2  and  (1)  imply  that  the  following  hold  for  all  operation  execu¬ 
tions  A,  B,  C,  and  £>  in  S: 

Al.  The  relation  — ►  is  an  irreflexive  partial  ordering. 

A2.  If  A  — ►  B  then  A  — >  B  and  B  A. 

A3.  If  A  — ►  B---*CorA---*B  — ►  C  then  A  -  -  -  C . 

A4.  If  A  — ►  B  -  C  — ^  D  then  A  — ►  D. 

A5.  For  any  A,  the  set  of  all  B  such  that  A  -f~*  B  is  finite. 

Instead  of  basing  the  formalism  on  this  model,  I  adopt  the  more  general 
view  of  [5]  and  take  A1-A5  as  axioms. 

Definition  1  A  system  execution  is  a  triple  S, — such  that  S  is  a 
finite  or  countably  infinite  set  and  and  -  -  -  are  relations  on  S  that 
satisfy  A1-A5. 
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Observe  that  Al  and  A4  imply  that  if  .4  — ►  B  and  A- B  then  B  -  /  -  .4, 
so  the  “and  B  in  A2  is  superfluous. 

Definition  1  differs  from  the  definition  of  a  system  execution  given  in  [5j 
because  I  am  considering  only  terminating  operations.  In  the  more  general 
formalism,  Axiom  A5  needs  the  hypothesis  that  A  terminates. 

Definition  2  A  global-time  model  of  a  system  execution  S , — ►, — •  con¬ 
sists  of  a  pair  s,f  of  real-valued  functions  on  S  satisfying  Ml,  M2  and  (1). 
It  is  said  to  be  nondegenerate  if,  for  all  A:  Sy^  <  f^  and  for  all  B  ^  A: 
3 A  #  SB  and  3a  ^  fa- 

A  nondegenerate  global-time  model  is  one  in  which  no  two  starting  or 
stopping  times  are  identical.  The  following  result  states  that  any  global¬ 
time  model  can  be  turned  into  a  nondegenerate  one  by  tiny  perturbations  of 
the  starting  and  finishing  times  of  operation  executions.  Such  perturbations 
should  be  allowed,  since  no  physically  meaningful  result  could  depend  upon 
completely  accurate  knowledge  of  these  times.  (It  makes  no  physical  sense 
to  specify  the  starting  and  finishing  times  of  an  operation  execution  down 
to  the  fraction  of  a  micropicosecond.) 

Proposition  1  For  any  any  global-time  model  3,f  of  a  system  execution 
S, — and  any  e  >  0,  there  exists  a  nondegenerate  global-time  model 
s',f'  of  S, — ►r--  such  that  <  t  and  |/^  —  /a1  <  f  for  all  .4  E  S. 

The  proofs  of  this  and  all  other  propositions  stated  in  this  section  are 
given  in  the  appendix. 

In  a  global-time  model,  the  starting  and  finishing  times  of  operations 
are  totally  ordered.  Given  two  operation  executions  A  and  B,  sb  must  be 
either  greater  than  or  not  greater  than  fx,  so  the  following  condition  holds. 

A#.  For  any  operation  executions  A  and  B  with  A  ^  B\  .4  — ►  B  or 
B--^  A. 

This  condition  does  not  hold  for  all  system  executions.  (Trivial  counterex¬ 
amples  are  obtained  by  noting  that  the  empty  precedence  relations  make 
any  set  a  system  execution.)  Condition  A#  holds  only  if  there  is  global-time 
model. 
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Proposition  2  A  system  execution  S, — has  a  global-time  model  if 
and  only  if  A#  holds. 

In  the  more  general  interpretation  of  operation  executions  given  in  [5], 
condition  A#  fails  to  hold  for  a  pair  of  operation  executions  A,B  \l  A  and 
B  occur  at  spatially  separated  locations,  and  they  both  happen  within  a 
time  interval  that  is  less  than  the  time  needed  for  light  to  travel  between 
their  locations.  In  most  systems  of  practical  interest,  A#  holds  for  almost 
all  pairs  A,  ^  of  operation  executions. 

The  following  result  shows  that  we  can  get  a  global-time  model  by 
adding  extra  precedence  relations. 

Proposition  3  Given  any  system  execution  S, — there  exist  exten¬ 
sions  — ►  of  — ►  and  -  -  -*  of  — ►  such  that  -•>  is  a  system  execution 

satisfying  Ajf(. 

Later,  I  will  indicate  why  we  can  consider  the  system  execution  S,  — ► 

,  -  -  to  be  a  reasonable  way  of  viewing  the  system  execution  S, — 

A  system  execution  satisfying  A#  is  maximal  in  the  sense  that  no  ad¬ 
ditional  — ►  or  — ►  relations  can  be  added.  This  is  because,  for  any  pair 
of  distinct  operation  executions  A  and  B,  A#  implies  that  either  A  — ►  B, 
or  B  — ►  A,  or  A  — '  B  and  B  — *  A.  In  any  of  these  three  cases,  adding 
an  additional  precedence  relation  would  violate  Al  or  A2. 

When  trying  to  understand  an  algorithm  or  its  correctness  proof,  it 
is  useful  to  think  in  terms  of  a  global-time  model,  drawing  pictures  of 
reads  and  writes  as  time  intervals.  However,  I  find  that  the  best  way  to 
formalize  the  proof  is  to  use  A.xioms  A1-A5.  The  additional  assumption 
A#,  implicitly  introduced  when  using  a  global-time  model,  is  not  needed. 

1.3.2  Hierarchial  Views 

The  same  system  can  be  viewed  at  different  levels  of  detail,  with  differ¬ 
ent  operation  executions  at  each  level.  Viewed  at  the  customer’s  level, 
a  banking  system  has  operation  executions  such  as  deposit  $10.  Viewed 
at  the  programmer’s  level,  this  same  system  executes  operations  such  as 
dep .amt[cust]  :=  1000.  The  fundamental  problem  of  system  building  is 
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to  implement  one  system  (like  a  banking  system)  as  a  higher-level  view  of 
another  system  (like  a  Pascal  program). 

A  higher-level  operation  consists  of  a  set  of  lower-level  operations — the 
set  of  operations  that  implement  it.  Let  5, — ►, — ►  be  a  system  execution 
and  let  M  he  a,  set  whose  elements,  called  higher-level  operation  executions, 
are  sets  of  operation  executions  from  S.  We  consider  the  starting  time 
of  a  higher-level  operation  execution  H  to  be  the  earliest  starting  time  of 
all  the  operation  executions  it  contains,  and  its  finishing  time  to  be  their 
latest  finishing  time.  In  other  words,  for  every  ff  in  M: 

s*ff  =  min{s.4  :  A  € 

=  max{/4  :Ae)i}  (2) 

In  order  for  this  to  define  real-valued  functions  s*  and  f*  on  U  that  satisfy 
M-  and  M2,  it  is  sufficient  for  U  to  satisfy  the  following  two  conditions; 

Hi.  Each  element  of  is  a  finite,  nonempty  set  of  elements  of  S. 

H2.  Each  element  of  5  belongs  to  a  finite,  nonzero  number  of  elements  of 

A  set  M  of  subsets  of  S  satisfying  Hi  and  H2  is  called  a  higher-level  view 
of  S.  In  most  cases  of  interest,  ^  is  a  partition  of  S,  so  each  element  of 
5  belongs  to  exactly  one  element  of  U.  However,  I  allow  the  more  general 
case  in  which  a  single  lower-level  operation  execution  is  viewed  as  part  of 
the  implementation  of  more  than  one  higher-level  one. 

Let  S, — *■, — >  be  a  system  execution  with  a  global-time  model  s,  f,  and 
let  ^  be  a  higher- level  view  of  5.  We  can  define  s*  and  /*  by  (2)  and 
then  use  (1)  to  define  and  ---,  obtaining  a  system  execution 

-  -  -*  having  s*,  /*  as  a  global-time  model.  The  precedence  relations  — ^  and 

-  -  -*  can  be  obtained  directly  from  — ►  and  -  -  ■»  as  follows: 

G  H  =  A  — ►B 

G--^H  =  3AeG  :3Be  H  :  A--^  B  or  A  =  B  (3) 

We  can  forget  about  the  global-time  models  and  take  (3)  to  be  the  defini¬ 
tions  of  and  -  -  It  is  easy  to  show  that  if  )i  satisfies  Hi  and  H2,  and 
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— ►  and  — ►  satisfy  A1-A5,  then  — ^  and  ---*  also  satisfy  A1-A5.  There¬ 
fore,  if  ^  is  a  higher-level  view  of  S,  then  is  a  system  execution. 

If  the  relations  — ►  and  -  -  -*  also  satisfy  A#,  then  so  do  — ^  and  -  - 

Let  us  now  consider  what  it  means  for  one  system  to  implement  an¬ 
other.  If  the  system  execution  S, — — ►  is  an  implementation  of  a  system 
execution  then  we  expect  ^  to  be  a  higher-level  view  of  5 — that 

is,  each  operation  in  )(  should  consist  of  a  set  of  operation  executions  of 
S  satisfying  Hi  and  H2.  This  describes  the  elements  of  1/,  but  not  the 
precedence  relations  and  -  -  What  should  those  relations  be? 

If  we  consider  the  system  execution  S  to  be  the  “real”  one  and  If  to  be 
a  fictitious  grouping  of  the  real  operation  executions  into  abstract,  higher- 
level  ones,  then  the  induced  relations  — ^  and  -  -  -»  are  the  “real”  precedence 
relations  on  These  induced  relations  make  the  higher-level  view  U  a 
system  execution,  so  they  are  an  obvious  choice  for  the  relations  and 
-  -  However,  they  may  not  be  the  proper  choice.  Suppose  that  we  are 
trying  to  implement  an  atomic  register  using  several  simpler  ones,  and 
consider  a  read  R  and  write  W  to  that  register — that  is,  R  and  W  are 
operation  executions  in  If  that  represent  a  read  and  write  to  the  register. 
Atomicity  means  that  either  R  — ►  W  or  W  — ►  R.  However,  the  two 
operation  executions  could  really  be  concurrent.  For  example,  there  could 
be  some  operation  executions  A  and  B  in  the  implementation  of  R  and  an 
operation  execution  C  in  the  implementation  of  W  with  A  — >  C  — ►  B, 
which  (by  (3))  implies  R--^W  and  W  -  R.  Thus,  (by  A2)  the  induced 
relations  and  -  -  -*  cannot  be  the  desired  relations  and  -  -  -. 

When  implementing  an  atomic  register  from  nonatomic  ones,  in  addition 
to  specifying  what  set  of  lower-level  operation  executions  corresponds  to  an 
atomic  read  or  write,  one  must  also  specify  how  to  determine  whether  a 
read,  which  may  really  be  concurrent  with  a  write  (according  to  the  induced 
relations  and  ---►),  is  considered  to  precede  or  follow  that  write.  This 
must  be  specified  in  such  a  way  that  the  register  satisfies  the  condition  of 
atomicity — namely,  that  each  read  obtains  the  value  written  by  the  most 
recent  write.  Subject  to  that  requirement,  there  is  a  great  deal  of  freedom 
in  specifying  the  high-level  relation 

The  implementor  cannot  be  completely  free  to  specify  the  precedence 
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relations  in  the  high-level  system  any  way  he  wishes.  For  example,  if  there 
is  at  least  one  write  of  every  possible  value  of  the  register,  then  any  sys¬ 
tem  execution  can  be  viewed  as  the  implementation  of  an  atomic  register 
by  choosing  the  — >  relation  to  be  a  sequential  ordering  of  the  reads  and 
writes  in  which  every  read  comes  between  any  write  of  the  value  it  read  and 
the  next  write  operation.  This  could  lead  to  a  precedence  relation  in  which 
an  operation  is  defined  to  precede  one  that  really  occurred  several  months 
earlier.  Such  a  precedence  relation  obviously  seems  absurd,  but  why?  la 
a.  real  system,  these  reads  and  writes  occur  deep  within  the  computer;  we 
never  actually  see  them  happen.  What  is  wrong  with  defining  the  prece- 
dence  relation  — ►  to  pretend  that  these  operation  executions  happened  in 
any  order  we  wish?  After  all,  we  are  already  pretending,  contrary  to  fact, 
that  the  operations  are  not  concurrent. 

In  addition  to  reads  and  writes  to  registers,  real  systems  perform  ex¬ 
ternally  observable  operation  executions  such  as  printing  on  terminals.  By 
observing  these  operation  e.xecutions,  we  can  infer  some  precedence  rela- 
tions  among  the  internal  reads  and  writes.  We  need  some  condition  on  — ► 
and  -  --*  to  rule  out  precedence  relations  that  contradict  such  observations. 

These  contradictions  are  avoided  by  requiring  that  the  interval  in  which 
we  pretend  an  operation  execution  occurs  (in  forming  the  — ►  and  -  -- 
relations)  be  contained  witulu  the  interval  in  which  it  actually  occurcd. 
In  other  words,  we  require  that  a  global-time  model  for 

satisfy 

<  f\  (-4) 

where  s*  and  /*  are  defined  by  (2).  To  reformulate  (4)  directly  in  terms  of 
the  precedence  relations,  I  appeal  to  the  following  result. 


Proposition  4  Let  s,f  be  a  nondegenerate  global-time  model  for  a  system 
execution  S, — and  let  5,-^,---*  be  a  system  execution  satisfying 
A#  such  that  for  any  A,  B  ^  S:  A  — ►  B  implies  A  B.  Then  there 
exists  a  nondegenerate  global-time  model  s',f'  for  S,— such  that  for 
all  A  €  S: 

^  ^ A  ^  f  A  — 


This  result  implies  that,  if  the  system  executions  S, — -  -*  and  U,- 
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---*  both  satisfy  A#,  then  the  ability  to  choose  and  satisfying  (4)  is 
equivalent  to  the  following  condition: 

H3.  For  any  G,H  ^  U:  if  G  H  then  G  H,  where  is  defined 
by  (3). 

This  should  serve  to  motivate  the  following  formal  definition,  which  does 
not  mention  global-time  models. 

Definition  3  A  system  execution  S , — — ►  implements  a  sysiem  execu¬ 
tion  if  Hl-HS  are  satisfied. 

To  relate  this  definition  to  the  preceding  discussion  of  observable  oper¬ 
ation  executions,  we  need  the  following  result.  Its  statement  relies  upon 

the  obvious  fact  that  if  S, — ►, — >  is  a  system  execution,  then  T, — >, - 

is  also  a  system  execution  for  any  subset  T  of  S.  (The  symbols  — ►  and 
--•»  denote  both  the  relations  on  S  and  their  restrictions  to  7.  Also,  in  the 
proposition,  the  set  T  is  identified  with  the  set  of  all  singleton  sets  {A}  for 
A  €7.) 

Proposition  5  Let  5uT, — ►, — ►  be  a  system  execution,  where  S  and  T 
are  disjoint;  let  $, — — >  be  an  implementation  of  a  system  execution  M . 

and  let  and  ---be  the  relations  defined  on  U  \J  T  by  (3). 
Then  there  exist  precedence  relations  and  -^7 *  guch  that: 

•  )luT ,  is  a  system  execution  that  is  implemented  by  SLiT ,  — ► 

I---*- 

•  The  restrictions  of  and-^--  to  M  equal  and---,  respectively. 

•  The  restrictions  of  and  --  -  to  T  are  extensions  of  the  relations 

— ►  and  --  -,  respectively. 

To  apply  this  proposition  to  our  discussion  of  implementaiions,  let 
S, — be  an  execution  of  a  lower-level  system  of  register  reads  and 
writes  implementing  a  higher-level  system  execution  of  reads 

and  writes.  Let  T  be  the  set  of  all  other  operation  executions  in  the  sys¬ 
tem,  including  the  observable  ones.  Proposition  5  means  that,  while  the 
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precedence  relations  and  -  -  -  may  imply  new  precedence  relations  on 
the  operation  executions  in  T,  these  relations  ( — ►  and  — >)  are  consistent 
with  the  “real”  precedence  relations  and  ---►  on  T. 

Note  that  when  there  are  global-time  models  for  all  the  system  execu¬ 
tions,  the  *  relations  are  the  same  as  the  original  precedence  relations  on 
the  set  T,  and  Proposition  4  implies  that  the  MT  relations  can  be  chosen 
also  to  be  the  same  as  the  original  precedence  relations  on  T.  However, 
in  general,  the  relation  — ►  may  contain  orderings  that  imply  additional 
orderings  on  the  elements  of  T  beyond  those  contained  in  — As  a  simple 
example,  let  A,  B  €  S,  let  S,  T  €  T,  let  S  — ►  A,  B  — ►  T  be  the  only 
precedence  relations  among  these  elements,  and  let  =  S.  If  A  — ^  B, 

then  A1  implies  5  T  even  though  5  -/-►  T. 

When  implementing  a  register,  I  will  ignore  any  operation  executions 
not  involved  in  the  implementation,  and  consider  the  system  execution 
comprised  only  of  the  reads  and  writes  that  implement  the  register.  Propo¬ 
sition  5  shows  that  the  implementation  cannot  lead  to  any  anomalous  prece¬ 
dence  relations  among  the  operation  executions  that  are  being  ignored. 

An  implementation  S, — ►, — ►  of  is  said  to  be  trivial  if  every 

element  of  ^  is  a  singleton  set.  In  other  words,  a  trivial  implementation 
is  one  in  which  each  higher-level  operation  execution  is  implemented  by 
a  single  lower-level  one.  In  a  trivial  implementation,  the  sets  S  and  M 
are  (essentially)  the  same;  the  two  system  executions  differ  only  in  their 
precedence  relations. 

Proposition  3  implies  that  any  system  execution  trivially  implements 
one  that  satisfies  A#,  which,  by  Proposition  2,  has  a  global-time  model. 
Implementation  is  transitive — if  S, — -  -*  implements  S',— -  -  which  in 
turn  implements  then  S, — ►, — ►  implements  When 

implementing  a  higher-level  system,  we  can  therefore  assume  the  lower-level 
system  execution  has  a  global-time  model.  However,  there  is  no  reason  to 
do  so;  a  rigorous  correctness  proof  using  Axioms  A1-A5  will  be  at  least  as 
simple  as  one  based  upon  starting  and  finishing  times,  and  will  be  more 
reliable  than  an  intuitive  one  based  upon  pictures  of  intervals. 


26 


1.3.3  Register  Axioms 

The  foregoing  discussion  applies  to  any  system  execution.  I  now  consider 
system  executions  containing  reads  and  writes  to  registers.  In  addition 
to  A1-A5,  some  axioms  special  to  these  kinds  of  operation  executions  are 
needed,  including  axioms  that  provide  the  formal  definitions  of  safe,  regular, 
and  atomic  registers. 

Axioms  A1-A5  do  not  require  that  there  be  any  precedence  relations 
among  operation  executions.  However,  some  precedence  relation  between 
a  read  and  a  write  to  the  same  register  must  be  assumed.  (Communication 
requires  a  causal  connection  between  reads  and  writes.)  The  following 
axiom  is  assumed;  the  reader  is  referred  to  {5|  (where  it  is  labeled  C3)  for 
its  justification.  Note  that  it  is  implied  by  A#. 

Bl.  For  any  read  R  and  write  \V  to  the  same  register,  R  otW  -  --> 

R  (or  both). 

Each  register  is  assumed  to  have  a  finite  set  of  possible  values — for 
example,  a  boolean-valued  register  has  the  possible  values  trve  and  false. 

I  assume  that  any  read,  whether  or  not  it  overlaps  a  write,  obtains  one  of 
these  values. 

B2.  A  read  of  a  register  obtains  one  of  the  values  that  may  be  written  in 
the  register. 

Thus,  a  read  of  a  Boolean  register  cannot  obtain  a  nonsense  value  like 
'^trlse' .  This  axiom  does  not  assume  that  the  value  obtained  by  a  read  was 
ever  actually  written  in  the  register. 

I  assume  that  a  register  v  is  written  by  only  a  single  writer,  and  that 
each  write  precedes  the  next.  Let  Vl‘I,  . , .  denote  the  sequence  of  write 
operations  to  the  register  u,  where 

V'lM  _  _ _  . . . 


and  let  t'I’T  denote  the  value  written  by  V'I’l  (There  may  be  a  finite  or 
infinite  number  of  write  operations  Vi'l.) 

A  register  v  is  assumed  to  have  some  initial  value  ul°l.  It  is  convenient 
to  assume  that  this  v’alue  is  written  by  a  write  that  precedes  ( — ►)  all 
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other  reads  and  writes  of  v.  Eliminating  this  assumption  changes  none  of 
the  results,  but  it  complicates  the  reasoning  because  a  read  that  precedes 
all  writes  has  to  be  treated  as  a  separate  case. 

Let  ^  be  a  read  of  register  v,  and  let 

'M  {kW  :  i?-/,  1/1*1} 

Jn  =  {1/1*1  :rl*l--^i?} 

It  follows  from  A2  and  the  assumption  that  1/1°1  precedes  all  reads  that  V'l°l 
is  in  both  In  and  J^;  and  it  follows  from  A2  and  A5  that  and  Jr  are 
finite.  The  writes  in  Jr  are  the  ones  that  could  affect  R.  For  the  sake  of 
the  following  intuitive  discussion,  suppose  that  A#  holds,  so  Ir  is  the  set 
of  writes  that  precede  ( — ►)  R.  (The  reader  interested  in  extending  his 
intuition  to  the  general  case  should  substitute  “precedes”  by  “effectively 
precedes” — a  concept  defined  in  [5].)  The  difference  Jr  —  Ir  of  these  two 
sets  is  the  set  of  writes  concurrent  with  R.  The  read  R  can  observe  “traces” 
of  the  values  written  by  writes  in  Jr  —  Ir,  >*..id  by  the  last  write  in  Ir.  All 
traces  of  earlier  writes  are  assumed  to  vanish  with  the  completion  of  the 
last  write  in  Ir,  and  no  write  later  than  the  last  one  in  Jr  can  influence  R 
in  any  way. 

I  will  say  that  R  sees  ul’"'!  if  it  can  observe  traces  of  the  writes  V'1'1 
through  V'bl.  The  formal  definition  is  as  follows; 

Definition  4  A  read  R  of  register  v  is  said  to  see  where: 

i  =  max{k  :  R-/^VW} 
j  “=  max{fc  :  -  -  - /?} 


This  definition  makes  sense  because  i  and  j  are  defined  to  be  the  maxima 
of  finite,  nonempty  sets — A5  and  A2  imply  that  they  are  finite,  and  they 
both  contain  zero.  Also  observe  that  Bl  implies  that  i  <  j. 

I  can  now  give  the  formal  definitions  of  safe,  regular,  and  live  registers. 
A  safe  register  is  one  that  obtains  the  correct  value  if  it  is  not  concurrent 
with  any  write.  This  is  the  case  if  it  observes  traces  of  only  a  single  write. 
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B3.  {safe)  A  read  that  sees  obtains  the  value 

A  regular  register  is  one  that  obtains  a  value  that  it  “could  have”  seen. 

B4.  {regular)  A  read  that  sees  obtains  a  value  for  some  k  with 
i  <k<j 

An  atomic  register  satisfies  the  additional  requirement  that  a  read  is  never 
concurrent  with  any  write. 

B5.  {atomic)  If  a  read  sees  then  i  =  j. 

A  safe  register  satisfies  B1-B3,  a  regular  register  satisfies  B1-B4  (note  that 
B4  implies  B3),  and  an  atomic  register  satisfies  B1-B5. 

The  following  two  propositions  state  some  useful  properties  that  are 
simple  consequences  of  Definition  4.  I  introduce  the  notation  of  letting 
stand  for  a  read  that  sees  the  value  Thus,  part  (a)  is  an  abbreviation 
for:  “If  i?  is  a  read  that  sees  and  R  — ►  then  — ”  (Recall  that 

is  the  k^^  write  of  v.) 

Proposition  6  (a)  If  — ►  V'[*l  then  j  <  k. 

(b)  IfVW  — ►  uM  then  k  <  i. 

(c)  If  vl'-d  — ►  then  j  <  t'  +  1. 

Proposition  7  If  R  is  a  read  that  sees  then 

(a)  k  <  j  if  and  only  if  — *  /?. 

(b)  i  <  k  if  and  only  if  R--^ 


In  a  global-time  view,  atomicity  is  usually  defined  to  mean  that  all 
operations  are  instantaneous.  In  B5,  it  is  dehned  by  the  requirement  that 
a  write  does  not  overlap  a  read.  However,  two  reads  may  overlap,  and  a 
write  could  overlap  some  operation  execution  that  is  not  a  read  or  write  of 
the  register.  It  is  easy  to  see  that,  given  a  global-time  model  for  a  system 
execution  satisfying  B5,  without  violating  conditions  Bl-Bo,  we  can  shrink 
the  intervals  occupied  by  reads  and  writes  so  that  they  overlap  no  other 
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operations.  Thus,  the  original  system  execution  implements  one  in  which 
reads  and  writes  of  the  atomic  register  are  instantaneous. 

For  a  nonatomic  register,  reads  and  writes  cannot  be  made  instanta¬ 
neous.  However,  the  reads  can  be  made  instantaneous. 

Proposition  8  Any  system  execution  S, — ►, - -  having  a  safe  or  regular 

register  v  trivially  implements  a  system  execution  S, — in  which  v  is 
also  safe  or  regular,  such  that  S,— has  a  global-time  model  in  which 
every  read  of  v  is  instantaneous. 

I  have  observed  that  a  regular  register  is  not  necessarily  atomic  because 
two  successive  reads  that  overlap  the  same  write  could  return  the  new  then 
the  old  value.  The  following  result  shows  that  this  is  the  only  way  a  regular 
register  can  fail  to  be  atomic. 

Proposition  9  Let  S, — ►, - -  be  a  system  execution  containing  reads  and 

writes  to  a  regular  register  v,  and  let  <t>  be  an  integer-valued  function  on  the 
set  of  reads  such  that: 

1.  If  R  sees  t’!'’-'!,  then  i  <  d>(R)  <  j. 

2.  A  read  R  returns  the  value 

3.  If  R  — ►  R'  then  d>(R)  <  0(/?'). 

Then  S, — >, - -  trivially  implements  a  system  execution  in  which  v  is  an 

atomic  register. 

A  function  (p  satisfying  the  first  two  properties  exists  if  and  only  if  v  is 
regular.  One  might  be  tempted  to  replace  these  three  properties  with  the 
requirement  that  v  be  regular  and  the  following  hold: 

3'  If  rh  -?]  — ►  t'l' '•'1  then  there  e.xist  k  and  k'  with  i  <  k  <  j  and 
i'  <  k'  <  j'  such  that  vh-’i  returns  the  value  and  returns  the 

value 

However,  this  does  not  imply  atomicity.  As  a  counterexample,  let  = 
=  0  and  t’^‘1  =  1,  let  Ri,R2,Rs  he  the  three  reads  shown  in  Figure  1.2, 
and  suppose  that  iZj  and  Rz  return  the  value  1  while  Rz  returns  the  value 
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Writes: 


Reads: 


— - 1  I - 


Time 


Figure  1.2:  An  interesting  collection  of  reads  and  writes. 

0.  The  reader  can  show  that  this  register  is  regular,  but  no  such  <f>  can  be 
constructed;  there  is  no  way  to  interpret  these  reads  and  writes  as  belonging 
to  an  atomic  register  while  maintaining  the  given  orderings  among  the 
writes  and  among  the  reads. 

If  two  reads  cannot  overlap  the  same  write,  then  implies 

j  <  i'.  This  implies  that  any  4>  satisfying  conditions  1  and  2  of  Proposition  9 
also  satisfies  condition  3.  But  such  a  <f)  exists  if  v  is  regular,  so  any  regular 
register  trivially  implements  an  atomic  one  if  two  reads  cannot  overlap  a 
single  write. 

1.3.4  Systems 

I  have  defined  a  system,  execution,  but  not  a  system.  Formally,  a  system  is 
just  a  set  of  system  executions — a  set  that  represents  all  possible  executions 
of  the  system. 

Definition  5  A  system  is  a  set  of  system  executions.  The  system  S  is 
said  to  contain  a  register  v  satisfying  one  or-more  of  the  properties  B1-B5 
if  every  system  execution  in  S  contains  a  sequence  V'**)  — — »  ■•■of  wnfes 
with  associated  values  . . .  and  a  set  of  reads  satisfying  the  corresponding 
properties. 

The  usual  method  of  describing  a  system  is  with  a  program  written  in 
some  programming  language.  Each  execution  of  such  a  program  describes 
a  system  execution,  and  the  program  represents  the  system  consisting  of 
the  set  of  all  such  executions.  The  only  operation  executions  that  concern 
us  are  reads  and  writes  of  a  register;  “calculation”  steps  can  be  ignored. 
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For  example,  execution  of  the  statemeat  x  yVz  includes  three  operation 
executions:  a  read  of  y,  a  read  of  z,  and  a  write  of  x.  It  does  not  matter 
whether  or  not  the  computation  of  the  V  is  considered  to  be  a  separate 
operation  execution.  What  is  significant  is  that  each  of  the  two  reads 
precedes  ( — ►)  the  write;  no  precedence  relation  is  assumed  between  the 
two  reads. 

A  formal  semantics  for  a  programming  language  can  be  given  by  defin¬ 
ing,  for  each  syntatically  correct  program,  the  set  of  all  possible  executions. 
This  is  done  by  recursively  defining  a  succession  of  lower  and  lower  higher- 
level  views,  in  which  each  operation  execution  represents  a  single  execution 
of  a  syntactic  program  unit.^  At  the  highest-level  view,  a  system  execution 
consists  of  a  single  operation  execution  that  represents  an  execution  of  the 
entire  program.  A  view  in  which  an  execution  of  the  statement  S;T  is  a 
single  operation  execution  is  refined  into  one  in  which  an  e.xecution  con¬ 
sists  of  an  execution  of  5  followed  by  ( — ►)  an  execution  of  T.*  While  this 
kind  of  formal  semantics  may  be  useful  in  studying  subtle  programming 
language  issues,  it  is  unnecessary  for  the  simple  language  constructs  used 
in  the  algorithms  of  this  paper,  so  I  will  just  employ  these  ideas  informally. 

Having  defined  what  a  system  is,  I  should  define  what  it  means  for  one 
system  to  implement  another.  The  definition  is,  of  course,  in  terms  of  the 
definition  of  what  it  means  for  one  system  execution  to  implement  another. 

Definition  6  The  system  S  implements  o  system  H  if  there  is  a  mapping 
t  ;  S  >-*•  H  such  that,  for  every  system  execution  S, — in  S.  S, — ►. 
— ►  implements  i(S, — ►,---*)• 

Note  that  for  S  to  implement  H,  every  execution  of  S  must  correspond 
to  some  execution  of  H.  The  converse  is  not  required;  I  do  not  insist  that 
every  possible  execution  of  H  have  a  corresponding  implementation.  A 
higher-level  description  H  of  a  system  can  be  viewed  as  a  specification  of 

^For  nonterminating  programs,  the  formalism  must  be  extended  to  allow  a  nonterminat- 
ing  higher-level  operation  execution  that  consist  of  an  inSnite  set  of  lower-level  operation 
executions. 

^In  the  general  ctise,  we  must  also  allow  the  possibility  that  an  execution  of  S\T  consists 
of  a  nonterminating  execution  of  5. 
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its  implementation — a  specification  that  describes  all  allowed  behaviors, 
but  does  not  require  any  particular  behavior. 

This  definition  raises  the  question  of  how  we  can  specify  that  the  system 
must  actually  do  anything.  The  specification  of  a  banking  system  must 
allow  a  possible  system  execution  in  which  no  customers  happen  to  use  an 
automatic  teller  machine  on  a  particular  afternoon,  and  it  must  include  the 
possibility  that  a  customer  will  enter  an  invalid  request.  How  can  we  rule 
out  an  implementation  in  which  the  machine  simply  ignores  all  customer 
requests  during  an  afternoon,  or  interprets  any  request  as  an  invalid  one? 

The  answer  lies  in  the  concept  of  an  interface  specification,  discussed 
in  [8].  The  specification  must  explicitly  describe  how  certain  interface  op¬ 
erations  are  to  be  implemented;  their  implementation  is  not  left  to  the 
implementor.  The  interface  specification  for  the  bank  includes  a  descrip¬ 
tion  of  what  sequences  of  keystrokes  at  the  teller  machine  constitute  valid 
requests,  and  the  set  of  system  executions  only  includes  ones  in  which  every 
valid  request  is  serviced.  What  it  means  for  someone  to  use  the  machine 
is  part  of  the  interface  specification,  so  the  possibility  of  no  one  using  the 
machine  on  some  afternoon  does  not  allow  the  implementation  to  ignore 
someone  who  does  use  it. 

Since  this  paper  considers  only  the  internal  operations  that  effect  com¬ 
munication  between  processes  within  the  system,  not  the  interface  opera¬ 
tions  that  effect  communication  between  the  system  and  its  environment, 
I  will  ignore  interface  specifications.  The  interested  reader  is  referred  to  [8] 
for  a  discussion  of  this  subject. 
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1.4  Correctness  Proofs  for  the  Constructions 


1.4.1  Proof  of  Constructions  1,  2  and  3 

These  constructions  are  all  simple,  and  the  correctness  proofs  are  essentially 
trivial.  Formal  proofs  add  no  further  insight  into  the  constructions,  but 
they  do  illustrate  how  the  formalism  developed  in  the  preceding  section  is 
applied  to  actual  algorithms.  I  therefore  indicate  all  the  formal  details  in  the 
proof  of  Construction  1.  The  formal  proofs  for  the  other  two  constructions 
are  just  briefly  sketched. 

Recall  that  in  Construction  1,  the  m*reader  register  v  is  implemented 
by  the  m  single-reader  registers  Formally,  this  construction  defines 
a  system,  which  I  denote  by  S,  that  is  the  sot  of  all  system  executions 
consisting  of  reads  and  writes  of  the  t;,-  such  that  the  only  operations  to 
these  registers  are  the  ones  indicated  by  the  readers’  and  writer’s  programs. 
Thus,  S  consists  of  all  system  executions  S, — such  that: 


•  S  consists  of  reads  and  writes  of  the  registers  v<. 

•  Each  Vi  is  written  by  the  same  writer  and  is  read  only  by  the  T*’ 
reader. 

•  For  any  i  and  if  the  write  occurs  then  the  write  also  ocurs, 

and  — ►  i’]*'. 


The  third  condition  expresses  the  formal  semantics  of  the  writer’s  algo¬ 
rithm,  asserting  that  a  write  of  v  is  done  by  writing  all  the  v,,  and  that  a 
write  of  v  is  completed  before  the  next  one  is  begun. 

To  say  that  the  v,-  are  safe  or  regular  means  that  the  system  S  is  further 
restricted  to  contain  only  system  executions  that  satisfy  B1-B3  or  B1-B4, 
when  each  i»,-  is  substituted  for  v  in  those  conditions. 

To  show  that  this  construction  implements  a  register  v,  Definition  6 
states  that  we  must  construct  a  mapping  t  from  S  to  the  system  H,  which 
consists  of  the  set  of  all  system  executions  formed  by  reads  and  writes  to  an 
m-reader  register  v.  To  say  that  v  is  safe  or  regular  means  that  H  contains 
only  system  executions  satisfying  BI-B3  or  B1-B4. 
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In  giving  the  readers’  and  writer’s  algorithms,  the  construction  implies 
that  for  each  system  execution  S, — of  S,  the  set  t(S)  of  operation 
executions  of  i(S,  — ►,  — ►)  is  the  higher-level  view  of  S, — -  -*  consisting 
of  all  writes  of  the  form  {V'/**, . . . ,  Vjfl},  for  6  S,  and  all  reads  of 
the  form  where  i2,  €  S  is  a  read  of  w,-.  (The  write  Vl*'  exists  in  t(S) 
if  and  only  if  some,  and  hence  all,  exists.)  Conditions  Hi  and  H2  are 
obviously  satisfied,  so  this  is  indeed  a  higher-level  view.  To  complete  the 
mapping  t,  we  must  define  the  precedence  relations  and  -  --so  that 
i($, — is  defined  to  be  i{S), — Proving  the  correctness  of  the 
construction  means  showing  that: 

1.  i(5),-^,---  is  a  system  execution — that  is,  it  satisfies  A1-A5. 

2.  S, — ►, — >  implements  i(S), — ►, — ► — that  is,  H1-H3  are  satisfied. 

3.  t(S),-^---  is  in  H — that  is,  B1-B3  or  B1-B4  are  satisfied. 

The  precedence  relations  on  ((S)  are  defined  to  be  the  “real”  ones,  with 
G  — ►  H  if  and  only  if  G  really  precedes  H.  Formally,  this  means  that  we 
let  and  -  -  -  be  the  induced  relations  and  -  -  -,  defined  by  (3).  Recall 
from  Section  3.2  that  the  induced  precedence  relations  make  any  higher- 
level  view  a  system  execution,  so  1  is  satisfied.  I  have  already  observ'ed  that 
Hi  and  H2,  which  are  independent  of  the  choice  of  precedence  relations,  are 
satisfied,  and  H3  is  trivially  satisfied  by  the  induced  precedence  relations, 
so  2  holds.  Therefore,  we  need  only  show  that  if  B1-B3  or  Bl-B-l  are 
satisfied  for  reads  and  writes  of  each  of  the  registers  u,-  in  S, — then 
they  are  also  satisfied  by  the  register  v  of  /(5),-^,---. 

Property  Bl  for  t(S),-^,---  follows  easily  from  (3)  and  property  Bl 
for  5, — Property  B2  is  immediate.  The  informal  proof  of  B3  is  as 
follows:  if  a  read  of  v  by  process  i  does  not  overlap  a  write  (in  i{S)),  then 
the  read  of  t;,-  does  not  overlap  any  write  of  t;,-,  so  it  obtains  the  correct 
value.  A  formal  proof  is  based  upon: 

X.  If  a  read  in  S, — sees  then  the  corresponding  read  {Ri} 
in  t(S),-^,---  sees  where  k'  <  k  <  I  <  I'. 
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The  proof  of  X  is  a  straightforward  application  of  (3)  and  Defintion  4. 
Property  X  easily  implies  that  if  B3  or  B4  holds  for  S, — then  it 
holds  for  This  completes  the  formal  proof  of  Construction  1. 

The  formal  proof  of  Construction  2  is  quite  similar.  Again,  the  induced 
precedence  relations  are  used  to  turn  a  higher-level  view  into  a  system 
execution.  The  proof  of  Construction  3  is  a  bit  trickier  because  a  write 
operation  to  v*  that  does  not  change  its  value  consists  only  of  the  read 
operation  to  the  internal  variable  x.  This  means  that  the  induced  prece¬ 
dence  relations  do  not  necessarily  satisfy  Bl;  they  must  be  extended  to 
make  Bl  hold.  This  can  be  done  by  applying  Proposition  3,  though  a  more 
“economical”  extension  can  also  be  constructed. 

1.4.2  Proof  of  Construction  4 

The  higher-level  system  execution  of  reads  and  writes  to  v  is  defined  to 
have  the  induced  precedence  relations  and  -  -  As  in  the  above  proofs, 
verifying  that  this  defines  an  implementation  and  that  Bl  holds  is  trivial. 
The  only  problems  are  proving  B2 — namely,  showing  that  the  reader  must 
find  some  v,  equal  to  one — and  proving  B4  (which  implies  B3). 

I  first  prove  the  following  property: 

Y.  If  a  read  returns  the  value  ft,  then  there  is  some  k  such  that  =  pi 
and  the  read  sees  with  /  <  k  <  r. 

If  B2  holds,  then  property  Y  implies  B4. 

Reasoning  about  the  construction  is  complicated  by  the  fact  that  a 
write  of  v  does  not  write  all  the  vj,  so  the  write  of  vj  that  occurs  during 
the  k^^  write  of  v  is  not  necessarily  the  k^^  write  of  vj.  To  overcome  this 
difficulty,  I  introduce  new  names  for  the  write  operations  to  the  vj.  If  vj  is 
written  during  the  execution  of  V!*!,  then  I  let  ivj**  denote  that  write  of  Vj\ 
otherwise,  wj**  is  undefined.  Thus,  every  write  of  vj  is  also  named  ivj'  ^ 
for  some  /'  >  /.  I  will  say  that  a  read  of  vj  sees  '  if  it  sees  and  the 
writes  Wf'^  and  are  the  same  writes  as  Vj\l\  and  V,[r|,  resf>ectively. 
Note  that,  because  the  writer’s  algorithm  writes  from  “right  to  left”,  if  VV'J*' 
exists,  then  so  do  all  the  ivj*'  with  j  <  t.  In  particular,  U’|*'  exists  for  all 
k. 
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Let  Rhe  a.  read  that  returns  the  value  /i,  and  let  fi  be  the  value,  so 
R  consists  of  the  sequence  of  reads  Ri  — *•  •  •  •  — >•  /Z,-,  where  each  Rj  is  a 
read  of  Vj.  All  the  Rj  return  the  value  0  except  Ri,  which  returns  the  value 
1.  Let  R  see  and  let  each  Rj  see  By  regularity  of  Vj,  there 

is  some  k{j)  with  l{j)  <  k{j)  <  r{j)  such  that  writes  a  1  and 

writes  a  0  for  1  <  ;  <  i.  Thus,  is  the  value  read  by  R,  so  it  suffices 
to  show  that  I  <  k{i)  <  r. 

Definition  4  implies  ---»  Ri,  which  by  (3)  implies  --->  R, 

which  implies  r(j)  <  r.  Hence,  k{i)  <  r. 

For  any  p  with  p  <  /,  Definition  4  implies  that  iZ  -/-►  which  implies 
that  R\  -j*  which  in  turn  implies  that  p  <  /(I).  Hence,  /  <  /(!).'* 
Since  l(j)  <  k{j),  it  suffices  to  prove  that  k{j)  <  l{j  +  1)  for  1  <  j  <  i. 

Since  k(j)  <  r{j).  Definition  4  implies  that  wj**''*'  ---►  Rj.  Because 
writes  a  zero,  exists,  and  we  have 

^  ji.  _ 

where  the  two  — ►  relations  are  implied  by  the  order  in  which  writing 
and  reading  of  the  individual  Vj  are  performed.  By  A4,  this  implies  that 
"  ^i+i)  which,  by  A2,  implies  /Zy+i  -/-►  VVj**/'*.  By  Definition  4, 
this  implies  that  k{j)  </(;  +  !),  completing  the  proof  of  property  Y. 

To  complete  the  proof  of  the  construction,  I  must  only  prove  that  every 
read  does  return  a  value.  Let  R  and  the  values  l(j),  k(j),  and  r(y)  be  as 
above,  except  let  i  =  n  and  drop  the  assumption  that  Ri  obtains  the  value 
1.  To  prove  B2,  I  must  prove  that  R^  does  obtain  the  value  1. 

The  same  argument  used  above  shows  that  if  Rj  obtains  a  zero,  then 
that  zero  was  written  by  some  write  which  implies  that  exists 

and  k(j)  <  l{j  +  1).  Since  R„  obtains  the  value  written  by  it  must 

obtain  a  I  unless  k(n)  =  0  and  the  initial  value  is  not  the  n**"  one.  Suppose 
the  initial  value  is  the  p'**  value,  encoded  with  Vp  =  1,  p  <  n.  Since  Rp 
obtains  the  value  0,  we  must  have  k{p)  >  0,  which  implies  that  k{n)  >  0, 
so  Rn  obtains  the  value  1.  This  completes  the  proof  of  the  construction. 

*Note  that  the  same  argument  does  not  prove  that  I  <  l{i)  because  does  not 
necessarily  exist. 
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1.4.3  Proof  of  Construction  5 

This  construction  defines  a  set  U,  consisting  of  reads  and  writes  of  v*,  that 
is  a  higher-level  view  of  a  system  execution  S, — whose  operation 
executions  are  reads  and  writes  of  the  two  shared  registers  v,  cw  and  cr.  As 
usual,  — ^  and  da*  denote  the  induced  precedence  relations  on  S  that  are 
defined  by  (3). 

Let  u  denote  the  shared  register  v,  cw  of  the  algorithm.  In  this  con¬ 
struction,  the  write  of  v*,  for  A:  >  0,  is  implemented  by  the  sequence 
R  — *•  where  is  a  read  of  cr  and  is  the  f'**  write  of 

u.  The  initial  write  of  v*  is  just  the  initial  write  of  u. 

Since  there  is  only  one  reader,  the  reads  of  v*  are  totally  ordered  by 
The  r**  read  5,-  of  v*  consists  of  the  sequence  /Z,  — ►  where  Ri  is 

the  j'**  read  of  u  and  CR\'^  is  the  t'**  write  of  cr.  For  notational  convenience, 
I  assume  an  imaginary  read  Rc  of  u  that  returns  the  value  u!°I,  and  I  define 
Sq  to  be  the  sequence  of  operations  Rq  — CR^°K  The  operation  5o  is 
taken  to  be  the  one  that  sets  the  initial  values  of  x'  and  cr’. 

The  proof  of  correctness  is  based  upon  Proposition  9.  Letting  ^(j) 
denote  to  apply  that  proposition,  it  suffices  to  choose  the  <^(»)  such 

that  the  following  three  properties  hold: 

•  Si  returns  the  value  v*I^**^l 

•  If  Si  sees  u*!'’'’!  then  I  <  0(i)  <  r. 

•  If  ;■  <  I  then  (p(j]  <  <l>{i). 

I  start  by  defining  a  function  0  such  that  Ri  returns  the  value  and, 

if  Ri  sees  then  I  <  V'lO  <  Since  u  is  regular,  such  a  0  exists. 
Proposition  6  implies: 

Zl.  If  j  <  i  then  i}{i)  <  tpli)  -  1. 

By  Proposition  7,  Suppose  V'(»)  =  2k.  Since 

is  part  of  is  part  of  and  Ri  is  part  of  5,-,  this 

implies  V**  -  5,-  -  V'*!*"*'*'.  Hence,  property  2  is  satisfied  if  <^(i)  =  k. 

Next,  suppose  that  ^'(i)  =2^—1,  where  k  >  0.  Since  is  part  of 

we  have  ---*  5,  ---*  so  property  2  is  satisfied 
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if  0(i)  =  k  —  1.  But  we  also  have  Hi,  so  property  2  is 

also  satisfied  if  <^(i)  =  k  —  1.  To  summarize,  property  2  is  satisfied  by  i  if 
th#»  following  holds: 

Z2.  (a)  If  ^(i)  =  2k  then  0(*)  =  k. 

(b)  If  ^(i)  —2k  —  I  then  4>{i)  =  k  or  (f){i)  =  k  -  1. 

The  second  statement  in  the  algorithm  of  Figure  1  consists  of  nested 
if  statements,  so  executing  it  executes  exactly  one  innermost  then  or  else 
clause.  I  will  use  a  sequence  of  t  (for  then)  and  e  (for  else)  characters 
to  denote  such  an  innermost  clause;  for  example,  tee  denotes  the  second 
innermost  else  clause,  which  is  executed  if  Xi  7^  xj  and  x[  =  x!,  =  xo. 

Let  a  ttt-read  be  one  that  executes  the  ttt  clause  of  the  reader’s  algo¬ 
rithm,  and  let  a  nice  read  be  one  that  is  not  a  ttt-read.  The  initial  read  So 
is  defined  to  be  nice.  For  any  i  >  0,  let  ir(i)  denote  the  largest  integer  such 
that  ff(j)  <  i  and  5,(,)  is  nice.  In  other  words,  5,(,)  is  the  last  nice  read 
before  5,-.  A  ttt-read  does  not  change  the  value  of  rtn,  x',  or  ex'.  Therefore, 
when  the  execution  of  5,-  begins,  rtn  has  the  value  returned  by  5,r(i)  and 
x',cr'  has  the  value  read  by  Rr(i). 

I  first  define  4>(i)  inductively  for  all  nice  reads,  starting  with  6(0)  = 
0.  The  definition  will  be  made  so  that  Z2  holds  for  all  i.  Let  i  be  a 
nice  read,  i  >  0,  and  assume  that  properties  1-3  and  Z2  hold  with  7r(i) 
substituted  for  i.  In  the  following  discussion,  I  will  refer  to  the  values 
of  variables  immediately  after  the  execution  of  the  first  statement  in  the 
reader’s  algorithm  during  the  operation  execution  5,-.  Thus,  x,  cr  is  the 
value  read  by  Hi,  rtn  is  the  value  returned  by  and 

x',cr'  is  the  value  read  by 

Consider  first  the  case  V’(i)  =  2k  —  1.  In  this  case,  Xi  =  and 

X2  =  If  xi  ^  X2,  then  properties  1  and  Z2  are  satisfied  only  by 

defining  6(i)  to  equal  k  —  I  \f  5,-  returns  the  value  Xj  and  to  equal  k  if  5, 
returns  the  value  Xj.  In  other  words,  </»(i)  equals  k  if  5,-  executes  the  tet 
clause  and  equals  t  —  1  otherwise.  Since  Z2  is  satisfied,  property  2  holds. 

To  prove  property  3  for  1,  it  suffices  to  prove  that  0(;r(i))  <  6{i),  since 
property  3  is  assumed  to  hold  for  7r(»).  Property  Zl  implies  that  '6{6{^))  < 
2k,  so  Z2  implies  that  <^(7r(i))  can  be  greater  than  0(i)  only  in  two  cases: 
(i)  and  0(i)  =  A:  -  1,  or  (ii)  =2k-l,  0(7r(i))  =  k,  and 
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(^{i)  =  k—  1.  But  ^(7r(i))  =  2k  implies  that  x[  =  zi  =  Xo,  so  5,-  executes  the 
tet  clause  and  4>{i)  =  k.  Hence,  case  (i)  is  impossible.  If  ip{n(i))  —  2k  —  I 
and  =  k,  then  x*  =  z  and  S^i,)  executes  the  tet  clause,  so  rtn'  =  xL 
Hence,  5,-  must  also  e.xecute  the  tet  clause,  so  0(i)  =  k,  showing  that 
case  (ii)  is  impossible.  This  completes  the  case  ^’(0  =2k  —  I  and  Xi  X2. 

If  xp{i]  =  2k  —  I  and  Xi  =  x;,  then  I  define  <t>(i)  to  be  the  ma.ximum 
of  —  1  and  <jl)^7r(i)).  Zl  and  Z2  (for  imply  that  <?!’(7r(j))  <  k,  so  this 
defines  <!>{%)  to  equal  either  k  —  I  or  k.  At  this  point,  I  note  the  following 
property  for  later  use; 

Z3.  If  ip{i)  =  2k  —  Xi  =  X2,  and  <j)(i)  =  k,  tneu  there  is  a  nice  read  Rj 
with  j  <  i  such  that  V'(i)  =  2fc. 

The  proof  of  Z3  is  by  induction  on  i.  The  hypothesis,  Zl  and  Z2  imply  that 
either  t^(7r(i))  =  2k,  in  which  case  we  can  let  j  =  7r(i),  or  else  r(7r(i))  = 
2k  —  I  and  ^(7r(j))  =  h  in  which  case  we  apply  Z3  with  ;r(i)  substituted 
for  i. 

Returning  to  the  definition  of  0(j),  in  the  case  under  consideration 
(0(i)  =s  2Ar  —  I  and  Xi  =  X2),  properties  I,  2,  and  Z2  are  satisfied  because 
<t>(i)  equals  either  k  -  I  or  k.  Moreover,  we  obviously  have  <i>(Ti(i))  <  0(1)' 
so  property  3  is  also  satisfied.  This  completes  the  case  ^(i)  =  2k  -  I  and 

Xi  X2. 

Finally,  I  consider  the  case  V-’i*)  =  2fc,  where  0(i)  must  be  defined  to 
equal  k  to  satisfy  Z2.  In  this  case,  Xi  =  X2  =  and  5,-  executes  the 
tte  clause,  returning  the  value  Xi.  (Since  5,-  is  assumed  to  be  nice,  it  does 
not  execute  the  ttt  clause.)  Hence,  property  1  is  satisfied.  Since  Z2  holds, 
property  2  is  satisfied.  To  prove  property  3  for  i,  it  sufiices  to  show  that 
0(^(*))  <  since  the  property  holds  for  7r(i).  By  Zl,  V^(7r(i))  <  2^'+l,  so 
<^(7r(i))  can  be  greater  than  ((>(i)  only  if  ^(;r(j))  =  2/;+!  and  (;i>(T(j))  =  k+\. 
There  are  two  possibilities  to  consider;  (i)  x\  ^  x',  and  (ii)  Xj  =  x!,.  In 
case  (i).  d{n{i))  can  equal  A:  +  1  only  if  5,(,)  executes  the  tet  clause,  which 
implies  that  x[  ^  X;  and  rtn  =  Xn',  but  this  is  impossible  since  5,-  executes 
the  tte  clause.  In  case  (ii),  Z3  implies  that  if  (f>(7r(j))  =  /:  -f-  1,  then  there 
exists  j  <  7r(i)  with  ~2k  +  2.  But  Zl  implies  that  this  is  impossible, 
since  j  <  i  and  Tp{i)  =  2k.  Hence,  property  3  holds.  This  completes  the 
construction  of  0(»)  for  all  nice  reads  5,-. 
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To  complete  the  definition  of  <t),  if  5,-  is  a  ttt-read,  I  define  (^(i)  to  equal 
(^(7r(i)).  Since  5,-  returns  the  same  value  as  property  1  is  satisfied. 
Property  3  obviously  holds,  since  it  bolds  for  nice  reads  and  0  assigns  to 
every  ttt-read  the  same  value  as  it  assigns  the  most  recent  nice  read.  The 
only  thing  left  to  prove  is  that  property  2  holds  for  a  ttt-read  5,-.  This  is 
perhaps  the  most  subtle  proof  of  the  entire  paper.  It  involves  proving  the 
remark  made  earlier,  that  if  a  sequence  of  reads  obtains  the  values 
(u,  fi),  and  (u,  i/),  all  of  the  same  color,  then  the  last  read  overlaps  the  write 
o{{u,fi). 

Let  Si  be  a  ttt-read,  and  let  (/i,  /i),  c  be  the  value  read  by  Ri.  Since 
Si  executes  the  ttt  clause,  x',cr',  which  is  the  value  j-g^d  by  Rr{i), 

must  equal  {i/,  fi),  c  for  some  u  ^  fi,  so  ip(n{i))  is  odd.  Let  'ip[Tt(i))  =  2A:—  1. 
Since  S,-  executes  the  ttt  clause,  S„(i)  must  return  fi,  so  it  must  execute  the 
tet  clause.  This  implies  that  (fi(7r(i))  =  k,  so<^(j)  =  k,  and  that  the  value  of 
cw  read  by  the  operation  execution  5,r(.)-!  must  also  equal  r.  so 
writes  the  value  c.  The  following  operation  executions  must  therefore  be 
performed  in  sequence  by  the  reader  (each  one  — ►’s  the  next,  but  the 
reader  may  peiform  other,  intervening  operation  executions): 

•  writes  cr[;r(j)  —  1]  =  c 

•  Rr{i)'  reads  =  (i/,fi),c 

•  Ri'.  reads  c 

•  writes  crl‘1  =  c 

Moreover,  the  reads  between  S^ii)  and  Si  also  write  the  value  c  in  cr. 
Therefore,  crbl  =  c  for  all  j  with  7r(j)  —  1  <  ;  <  i-  Note  also  that 
0(i)  =  <f>(Tr{i))  =  k  -  1. 

It  follows  from  Zl  that  ip(i)  >  2k— 2.  If  =  2k— 2,  then  Proposition  7 
implies  that  Ri  -  However,  that  proposition  also  implies  that 

Since  aii(j  we  see  that 

U[zk-2\  — ,  i2,.  ...  This  implies  -1.  Si-*- ^  K’l''!.  Since 

<^(i)  =  k  —  1,  property  2  follows  from  Proposition  7. 

I  have  shown  that  ip(i)  >2k  —  2  and  property  2  holds  if  =  2k  —  2. 
To  finish  the  proof,  I  now  show  that  ^(»)  =  2^  —  2  by  assuming  i>{i)  > 
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2k  — 2  and  obtaining  a  contradiction.  Since  equals  {p>,n),c  and 

equals  neither  of  which  equals  (because  fi  ^  u),  we  must  have 

ti’(i)  >  2!:.  Let  denote  the  read  of  cr  in  the  write  of  r*  of  which 

is  a  part.  Since  sets  cw  to  c,  the  read  crl'-’’!  must  obtain  the  value 

-'C.  The  writer  must  therefore  perform  the  following  sequence  of  operation 
e.xecutions,  where  each  — ►’s  the  ne.xt.  (There  may  be  other,  intervening 
operation  e.xecutions.) 

•  writes  =  (/i,/i),c 

•  reads  the  value  -<c 

•  f,dv(«)l;  writes  =  (i/,u),c 

By  Proposition  7  (and  the  definition  of  tjj),  We  therefore 

have 

^  cft'-'l 

so  — ►  crl'  ’’!.  By  part  (b)  of  Proposition  6,  this  implies  n(i)  —  1  < 

1. 

Proposition  7  implies  A;-,  so 

crl‘-d  — .  ---  A,-  — ►  CAI’T 

This  implies  — ►  CR^'\  so  part  (a)  of  Proposition  6  implies  r  <  i.  We 
therefore  have  rp{i)  —  1  <  /  <  r  <  i,  so  regularity  of  cr  implies  that  crl''''l 
obtains  a  value  crld  with  ^ll(i)  —  1  <  ;  <  »•  However,  I  already  observed 
that  all  such  values  equal  c,  and  crl'’’’!  obtains  the  value  -ic.  This  is  the 
required  contradiction,  completing  the  proof. 
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2.1  Abstract 


This  section  of  the  report  proposes  a  new  an  ay  processor  architecture  that 
is 

•  Effective  for  arbitraxy  programs  that  cannot  be  mapped  onto  regular 
array  structures  and  that,  consequently,  perform  poorly  on  existing 
array  processors 

•  Capable  of  operation  in  a  fault-tolerant  mode 

•  Physically  structured  to  permit  high-performance  VHSIC  implemen¬ 
tation. 


2.2  Background 


There  is  no  need  to  enumerate  the  problems  for  which  our  current  high- 
performance  computers  are  inadequate;  the  list  would  be  endless.  More¬ 
over,  there  are  many  important  problems  for  which  our  current  computers 
are  several  orders  of  magnitude  too  slow.  Remarkable  as  have  been  the 
improvements  in  computer  performance  over  the  pzist  40  years,  there  is 
nonetheless  no  possibility  that  the  undoubted  continued  increases  in  per¬ 
formance  will  suffice  tc  meet  our  future  needs. 

The  improved  performance  of  conventional  von  Neumann  computers  has 
been  due  largely  to  improved  electronic-component  technology  that  allows 
faster  clock  cycles  and  the  use  of  more  complex  faster  circuits,  as  well  as  to 
improved  designs  that  permit  operations  to  be  executed  in  fewer  cycles  and 
several  operations  to  be  performed  concurrently.  While  further  improve¬ 
ments  in  electronics  can  be  expected,  there  are  very  real  limitations  on  the 
extent  to  which  increased  concurrency  is  possible  while  still  maintaining 
the  von  Neumann  illusion  of  purely  sequential  operation. 

Even  better  performance  has  been  achieved  by  making  the  processors 
more  specialized  in  structure.  Two  primary  examples  are  the  vector  proces¬ 
sors,  such  as  the  Cray  and  Star  computers,  which  are  very  effective  for  pro¬ 
cessing  large  matrices  uniformly,  and  the  systolic  processors,  which  are  very 
effective  for  FFT,  convolution,  and  similar  signal-processing  applications. 
Unfortunately,  there  are  many  applications  that  do  not  lend  themselves  to 
such  specialized  processing  strategies.  For  such  applications,  only  parallel 
processing  of  the  problem  by  many  cooperating  processors  (whether  von 
Neumann  or  not)  can  result  in  substantially  faster  processing. 

Not  only  do  improvements  in  electronic-component  technology  allow  the 
construction  of  very-high-performance  circuitry,  but  they  also  permit  uni¬ 
form  replication  of  relatively  simple  electronic  circuits  at  very  low  cost.  It  is 
clear  that  the  ideal  structure  for  VLSI  implementation  of  a  multiprocessor 
architecture  consists  of  a  regular  array  of  processors. 


46 


Past  experience  in  using  array  processors,  however,  has  not  been  very 
encouraging.  The  prototypical  Illiac  IV  computer  and  the  generally  simi¬ 
lar  Intel  Hypercube  have  shown  to  be  effective  only  when  communication 
within  the  array  is  almost  entirely  between  adjacent  processors.  Commu¬ 
nication  between  arbitrary  processors  requires  that  the  data  be  passed  via 
a  chain  of  intermediary  processors,  which  is  slow  and  absorbs  an  ex'^es- 
sive  amoimt  of  system  resources.  Even  for  suitable  problems,  it  has  been 
found  to  be  difficult  to  map  the  problem  domain  onto  the  array  so  as  to 
obtain  reasonable  efficiency;  moreover,  the  approach  appears  to  be  almost 
completely  ineffective  for  less  suitable  problems. 

Another  class  of  array  processors  is  the  SIMD  (single  instruction  -  mul¬ 
tiple  data)  machines,  exemplified  by  the  Connection  Machine.  SIMD  ma¬ 
chines  are  very  effective  whenever  substantially  the  same  sequence  of  oper¬ 
ations  must  be  applied  to  a  large  proportion  of  the  cells  of  the  array.  The 
Coimection  Machine-which,  with  its  one-bit  processors,  is  highly  suited 
to  image-processing  applications-has  been  used  with  great  ingenuity  in  a 
number  of  applications.  But  many  applications  require  a  significant  pro¬ 
portion  of  special-case  processing  and  are  not  implemented  efficiently  in  an 
SIMD  architecture.  Furthermore,  the  problem  of  communication  between 
arbitrary  processors  in  the  array  is  still  significant. 

More  general  are  the  MIMD  (multiple-instruction  multiple-data)  ma¬ 
chines,  exemplified  by  the  Butterfly  Machine.  Such  machines  contain  a  set 
of  processors  «ind  a  set  of  shared  storage  modules,  the  processors  access¬ 
ing  the  memory  through  an  array  of  delta  switches.  The  Butterfly  Machine 
also  provides  a  direct  link  between  each  processor  and  its  associated  storage 
module.  Access  to  the  memory  by  means  of  delta  switches  maintains  the 
appearance  of  a  single  uniform  memory,  equally  accessible  to  all  processors, 
but  involves  contention  among  processors  for  use  of  the  switches,  and  thus 
substantially  increases  memory  access  time.  Consequently,  efficient  use  of 
the  machine  requires  that  most  of  a  processor’s  memory  references  be  made 
to  its  own  associated  module;  this  results  in  allocation  problems  similar  to, 
but  less  critical  than,  those  encountered  with  the  Illiac  IV  type  machines. 
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The  problem  of  mapping  an  application  onto  an  array  is  greatly  aggra¬ 
vated  by  the  presence  of  faulty  processors.  In  any  large  array,  it  is  inevitable 
that  there  will  be  processors  that  have  failed.  Allowance  must  also  be  made 
for  transient  faults,  which  are  much  more  frequent  than  solid  faults  and  may 
cause  a  significant  rate  of  erroneous  results  from  a  large  array.  The  use  of 
VLSI  with  very  small  device  dimensions,  as  might  be  expected  in  an  array 
implementation,  inevitably  increases  the  rate  of  transient  faults. 

Error?  resulting  from  faults  must  be  detected  amd  corrected  so  as  to 
protect  the  validity  of  the  results.  While  it  is  conceivable  that  adequate 
protection  against  solid  faults  could  be  provided  (at  least  for  batch  pro- 
cessing)  through  some  form  of  periodic  testing  of  the  processors,  detection 
of  errors  resulting  from  transient  faults  necessarily  requires  some  form  of 
replication  and  comparison  for  all  processing  operations. 

The  presence  of  a  faulty  processor  requires  that  either 

•  The  error  detection  and  correction  algorithms  be  strong  enough  to 
mask  the  faulty  processors  continuously  (e.g.  by  majority  voting),  and 
that  repair  be  rapid  enough  to  reduce  the  rate  of  multiple  concurrent 
faults  to  an  acceptable  level,  or 

•  The  mapping  of  the  application  onto  the  array  be  modified  to  avoid 
having  to  use  the  faulty  processor. 

Mapping  the  application  onto  an  array  can  be  quite  difficult  even  in  the 
absence  of  faulty  processors,  and  is  certainly  not  simplified  by  introducing 
irregularities  into  the  array  structure.  Local  adaptation,  such  as  transfer¬ 
ring  the  workload  of  the  faulty  processor  onto  neighboring  processors,  may 
overload  the  processors  and  increase  communication  delays.  Global  adapta¬ 
tion,  if  possible,  will  involve  moving  Isu'ge  amounts  of  data  to  accommodate 
the  revised  mapping  of  the  application  onto  the  array.  The  difficulties  of 
adaptive  reconfiguration  suggest  thac  continuous  error  detection  and  cor¬ 
rection,  which  is  also  effective  against  transient  faults,  may  be  preferable 
in  many  circumstances. 
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Of  coTirse,  there  are  some  applications  in  which  most  of  the  calculation 
comprises  a  search  and  for  which  a  comparatively  short  check  can  be  made 
at  the  end  to  confirm  the  validity  of  the  solution.  For  such  calculations,  fault 
tolerance  may  be  less  essential.  There  axe  also  some  applications  for  which 
the  rate  of  processor  failure  may  be  substzintial  and,  in  addition,  immediate 
recovery  from  error  is  essential-the  most  obvious  example  thereof  being  the 
SDI  Battle  Management  System. 

In  considering  an  array  processor  intended  for,  say,  ten  years  hence,  we 
Ccin  reasonably  make  certain  assumptions: 

•  Main  storage  will  become  very  inexpensive,  and  moderate  perfor¬ 
mance  processors  will  become  quite  inexpensive. 

•  The  major  costs  and  the  primary  physical  constraints  will  be  associ¬ 
ated  with  the  interconnection  interfaces;  the  performance  of  the  array 
will  be  determined  largely  by  communication  costs. 

•  High-density  packaging  and  interconnection  techniques  can  be  applied 
most  effectively  if  the  logical  structure  of  the  system  corresponds  to 
a  feasible  physical  structure. 

•  Although  individual  nodes  in  the  zirray  will  be  quite  reliable,  a  large 
array  must  necessarily  contain  faulty  nodes. 

2.3  Objectives 

For  some  applications,  a  very  close  match  between  the  structure  of  the 
application  and  the  structure  of  the  array  processor  is  not  only  possible, 
but  offers  some  advantages  from  the  standpoint  of  performance.  The  In¬ 
tersecting  Broadcast  Machine  is  not  intended  to  be  competitive  for  such 
applications,  however. 

But  there  are  important  applications  that  do  not  map  easily  onto  an 
array  and  for  which  the  performance  of  array  processors  is  poor.  Our  ob- 
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jective  is  an  architecture  that  performs  well  for  arbitrary  applications  in 
which  there  does  not  seem  to  be  any  preferable  mapping  onto  a  regular 
structure.  In  the  absence  of  a  systematic  mapping,  the  allocation  of  activ¬ 
ities  among  processors  becomes  essentially  random;  thus,  the  architecture 
mxist  perform  well  with  such  a  random  allocation  and  with  the  consequent 
random  commimication. 

To  ensure  reasonable  performance,  we  seek  a  connectivity  structure  in 
which  data  located  randomly  within  the  array  can  be  communicated  di¬ 
rectly  from  its  source  to  its  point  of  use  without  being  forwarded  through 
intermediate  nodes. 

To  ensure  feasible  construction  with  existing  high-density  packaging  as 
well  as  eventual  construction  on  the  surface  of  a  wafer,  we  seek  a  two- 
dimensional  structure. 

To  ensure  correct  operation  in  the  presence  of  faults,  both  solid  and 
transient,  we  seek  an  architecture  that  is  inherently  fault-tolerant. 


2.4  Structure  of  the  Intersecting  Broadcast 
Machine 

The  Intersecting  Broadcast  Machine  consists  of  two  orthogonal  sets  of 
buses.  Processors  are  located  at  the  intersections  between  buses,  each  pro¬ 
cessor  having  two  interfaces,  one  connecting  to  each  of  the  two  buses  at 
that  intersection.  Th\is,  for  n  buses  in  each  set,  there  are  n*  processors.  An 
arbitration  mechanism  for  each  bus  allocates  that  bus  among  contending 
processors.  The  information  broadcast  on  a  bus  is  received  and  stored  by 
every  processor  connected  to  the  bus.  Processors  that  will  never  use  the  in¬ 
formation  must  still  store  it,  at  least  temporarily,  thus  entailing  additional 
storage  that  is  substantial  in  volume  but  modest  in  cost. 

Consider  a  processor,  randomly  located  within  the  array,  that  has  com¬ 
puted  a  value,  denoted  in  Fig.  2.1  as  A.  That  processor  broadcasts  its  result 
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Figure  2.1:  The  Formation  of  Intersections.  The  broadcasting  of  two  results,  A 
and  B,  &om  random  locations  in  the  array  always  yields  at  least  two  nodes  at 
the  intersections  of  the  broadcast,  where  the  next  stage  of  the  computation  can 
be  executed. 

on  each  of  the  two  buses  to  which  it  is  connected,  and  every  processor  along 
both  buses  receives  and  stores  that  value.  Another  processor,  also  randomly 
located,  computes  another  value  B,  which  is  similarly  broadcast  over  two 
buses  and  stored  by  processors  along  those  buses.  There  are  now  two  pro¬ 
cessors,  at  the  intersection  of  the  broadcasts,  that  have  both  values  and 
can  continue  the  computation.  It  should  be  noted  that  there  was  no  need 
to  plan  or  even  to  know  in  advance  where  the  results  would  be  computed; 
the  design  thus  lends  itself  to  complex  calculations  for  which  such  planning 
would  be  difficult 

It  is  zmticipated  that  the  array  processor  will  normally  be  operated  in 
a  fault-tolerant  mode,  as  described  below.  However,  it  may  be  appropriate 
to  run  some  calculations  without  fault  tolerance.  We  describe  such  an 
operation  here. 

Since  each  computation  is  preferably  done  only  once  in  the  array,  it  is 
necessary  to  select  one  of  the  two  processors  at  the  intersections  to  perform 
the  computation.  This  processor  can  be  selected  algorithmically,  but  here 
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Figure  2.2:  The  Selection  of  a  Processor  by  Means  of  a  Race.  One  of  the  two 
processors  at  the  intersections  wins  the  race  and  broadcasts  its  results  (shown 
solid).  Auxiliary  broadcasts  (shown  broken)  inhibit  the  other  processor  from 
broadcasting  its  results. 

prefer  to  advocate  a  race.  Each  processor  enqueues  the  operation  along 
with  other  operations  it  mxist  perform  and,  when  the  operation  has  been 
completed,  the  result  f(A,B)  is  broadcast  on  the  two  buses  to  which  the 
processor  is  connected.  As  shown  in  Fig.  2.2,  one  of  the  processors  will  win 
the  race.  Its  broadcasts  not  only  communicate  the  result,  but  also  inform 
the  processors  that  had  computed  and  broadcast  the  A  and  B  values  that 
the  result  f(A,B)  has  been  computed.  These  two  processors  then  gener¬ 
ate  auxiliary  broadcasts  on  the  orthogonal  buses.  Because  the  auxiliary 
broadcasts  carry  the  identity  or  designator  of  the  result  f(A,B)  but  not 
its  value,  they  can  be  much  briefer.  In  the  figure  the  auxiliary  broadcasts 
are  denoted  by  broken  lines.  These  broadcasts  inform  the  fourth  processor 
that  the  computation  has  been  completed  and  broadcast,  and  that  there  is 
consequently  no  need  for  it  to  broadcast  that  result  as  well. 

In  the  event  the  auxiliary  broadcasts  do  not  reach  the  fourth  processor  in 
time  to  inhibit  its  broadcasts,  every  processor  on  the  four  buses  will  receive 
both  a  main  and  an  auxiliary  broadcast  for  the  value  f(A,B);  each  processor 
would  then  apply  an  algorithmic  selection  criterion.  It  is  possible,  if  not 
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likely,  that  some  of  the  processors  may  have  already  initiated  subsequent 
operations  on  the  baisis  of  a  broadcast  when  the  auxiliary  broadczist  is 
received  that  inhibits  that  broadcast.  The  processor  can  readily  abandon 
internal  processing,  but  special  action  is  required  to  cancel  results  that 
have  themselves  already  been  broadcast.  The  technique  required,  known 
as  a  chase  protocol,  is  discussed  below  in  the  section  on  fault  tolerance. 

The  main  and  auxiliary  broadcasts  also  serve  other  functions: 

•  Along  each  bus  there  axe  many  processors  that  have  received  and 
stored  one  of  the  two  values  and  are  waiting  for  the  second  value, 
which  they  will  never  receive.  The  receipt  of  the  main  or  auxiliary 
broadcast  for  f{A,B)  indicates  that  such  processors  will  not  be  re¬ 
quired  to  compute  f(A,B),  and  therefore  can  be  used  to  drive  the 
storage  management  algorithms. 

•  If  the  calculation  involves  the  updating  of  a  value  and  there  are  several 
possible  updates,  the  race  is  not  just  between  the  two  processors 
at  the  intersections  but  also  between  concurrent  calculations.  The 
broadcasts  may  indicate  which  calculation  should  proceed  and  which 
should  be  restarted  with  the  new  updated  value.  In  more  complex 
cases,  the  race  should  be  to  claim  a  semaphore. 

Even  though  the  structure  described  in  this  section  is  not  intended 
to  be  fault-tolerant,  it  does  exhibit  some  measure  of  tolerance  for  faulty 
processors.  If  one  of  the  two  processors  at  the  intersections  has  failed ,  the 
other  is  available  to  perform  the  operation. 


2.5  Fault  Tolerance 

The  concept  is  readily  extended  to  provide  fault  tolerance  and,  indeed,  we 
do  not  believe  that  a  large-array  processor  can  be  operated  effectively  in 
any  other  mode.  To  provide  fault  tolerance,  each  value  is  computed  and 
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broadcast  by  two  processors  in  the  array.  Fig.  2.3  shows  the  values  A  and 
B,  each  computed  and  broadcast  by  pairs  of  randomly  located  processors 
in  the  array;  the  broadcasts  for  B  are  indicated  by  broken  lines.  Note  that 
there  are  eight  nodes  at  which  the  values  A  and  B  are  both  available.  It 
would,  of  course,  be  inappropriate  for  the  result  f(A,B)  to  be  computed  and 
broadcast  by  aill  eight;  our  objective  b  to  have  the  result  f(A,B)  broadcast 
by  two  nodes,  just  as  the  values  A  and  B  are. 
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2  The  Eight  Intersections  Resulting  from  Replicated  Broadcasts.  When 
each  value  is  computed  and  broadcast  by  two  processors  in  the  array,  there  are 
eight  nodes  at  the  intersections  of  the  broadcasts,  where  the  next  stage  of  the 
computation  can  be  performed. 

Here  too  it  is  possible  to  select  the  two  nodes  algorithmically  or  by 
means  of  a  race  condition.  For  the  latter  approach,  two  alternatives  are 
available,  depending  on  how  the  auxiliary  broadcasts  are  generated.  We  de¬ 
scribe  first  the  alternative  that  follows  more  closely  the  approach  described 
above  for  the  unreplicated  case. 

Fig.  2.4  shows  that  one  of  the  processors  has  completed  the  next  stage 
of  the  computation,  denoted  by  f(A,B),  and  has  broadcast  its  value  on  the 
two  orthogonal  buses.  These  broadcasts  serve  not  only  to  communicate  the 
value,  but  abo  to  inform  two  of  the  other  processors  in  the  set  of  eight 
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Figure  2.4:  The  Selection  of  a  Processor  by  Means  of  a  Race.  One  of  the  eight 
processors  at  the  intersections  wins  the  race  and  broadcasts  its  results  (shown 
solid).  These  broadcasts  inhibit  two  of  the  other  processors,  and  auxiliary  broad* 
casts  (shown  broken)  inhibit  yet  another  three  processors  from  broadcasting  their 
results.  Auxiliary  broadcasts  are  generated  by  the  processors  that  provide  the 
input  values. 

that  the  result  has  been  computed  and  broadcast.  As  above,  auxiliziry 
broadcasts  arc  generated  by  the  two  processors  that  broadcast  the  A  and  B 
values,  shown  in  the  figure  by  broken  lines.  The  auxiliary  broadcasts  serve 
to  inhibit  three  more  processors  of  the  set  of  eight,  leaving  two  processors 
from  that  set. 

As  shown  in  Fig.  2.5,  one  of  the  two  remaining  processors  has  computed 
and  broadcast  the  value  f(A,B).  Here  again  two  auxiliary  broadcasts  have 
been  generated  by  the  processors  that  broadcaist  the  A  and  B  values. 

Fig.  2.6  shows  an  alternative  approach  to  auxiliary  broadcasts.  Here  the 
set  of  eight  intersection  nodes  divides  into  two  groups  of  four  each:  a  group 
that  receives  A  on  a  horizontal  bus  and  B  on  a  vertical  bus,  and  a  second 
group  that  receives  A  and  B  in  the  other  directions.  We  must  select  one 
node  from  each  group  to  continue  the  computation.  The  figure  shows  that 
one  processor  has  completed  the  computation  of  f(A,B)  and  has  broadcast 
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Figure  2.5:  The  Selection  of  a  Second  Processor.  One  of  the  two  remaining 
processors  wins  the  race  and  broadcasts  its  results.  In  this  case  too,  auxiliary 
broadcasts  inhibit  the  other  processor. 


Figure  2.6:  The  Selection  of  a  Processor  by  Means  of  a  Race.  One  of  the  eight 
processors  at  the  intersections  wins  the  race  and  broadcasts  its  results  (shown 
solid).  These  broadcasts  inhibit  two  of  the  other  processors.  Here  auxiliary 
broadcasts  (shown  broken)  are  generated  by  the  inhibited  processors  and  inhibit 
only  one  other  processor. 


56 


its  result  value  on  the  two  buses.  These  broadcasts  inform  two  of  the  group 
of  four  processors  that  the  result  has  already  been  computed  and  thiis  to 
inhibit  them  from  also  broadcasting  the  result.  These  two  processors  then 
generate  the  auxiliary  broadcasts.  This  differs  from  the  foregoing  approach, 
in  which  the  auxiliary  broadcasts  were  generated  by  the  processors  that 
had  provided  the  A  and  B  values.  The  auxiliary  broadcasts  indicate  to 
the  fourth  processor  of  the  group  that  the  result  heis  been  computed  and 
broadcast. 


FigTire  2.7:  The  Selection  of  a  Second  Processor.  The  second  group  of  four 
processors  is  handled  similarly.  The  main  broadcasts,  done  by  the  processor 
winning  the  race,  inhibit  two  of  the  processors,  while  the  auxiliary  broadcasts 
inhibit  the  fourth. 

The  second  group  of  four  processors  is  handled  similarly,  as  shown  in 
Fig.  2.7. 

In  both  of  these  alternatives,  we  started  with  the  computation  and 
broadcastingof  the  values  A  and  B  by  two  processors  in  the  array;  now  the 
next  computation  is  similarly  performed  and  broadcast  by  two  processors, 
<is  shown  in  Fig.  2.8.  Note  that  for  each  alternative,  on  each  bus  carrying  a 
broadcast  value,  there  are  three  processors  that  hold  independent  versions 
of  that  value: 
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Figure  2.8:  Fault  Tolerance.  Initially  the  A  and  B  values  are  each  broadcast  by 
two  processors  in  the  array  and,  subsequently,  f(A,B)  is  also  broadcast  by  two 
processors.  On  each  bus  there  are  three  processors  with  independently  computed 
values,  thereby  allowing  majority  voting  if  necessary. 

•  The  processor  that  computed  and  broadcast  the  value 

•  Another  processor  that  can  compute  the  result  but  lost  the  race 

•  A  processor  that  cannot  compute  the  result  but  received  it  on  an 
orthogonal  btis. 

Consider  the  upper  horizontal  bus  in  Fig.  2.8.  There,  is  a  processor  that 
has  computed  and  broadcast  the  value  f(A,B),  while  to  its  right  another 
processor  has  computed  the  result  but  was  inhibited  by  prior  broadcast  of 
that  result.  To  the  left  is  a  processor  that  has  not  computed  the  result  but 
received  the  value  f(A,B)  from  the  broadcast  by  the  selected  processor  of 
the  second  group  of  four.  Any  difference  between  the  broadcast  value  and 
these  alternative  values  results  in  the  latter’s  also  being  broadcast;  every 
processor  on  the  bus  then  has  three  independently  computed  values,  among 
which  it  can  choose  by  majority  voting. 

Unfortunately,  this  majority  vote  occurs  after  the  first,  possibly  erro¬ 
neous  value  has  been  received  by  all  of  the  processors  along  the  bus.  If  the 
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majority  vote  indicates  that  the  first  value  is  indeed  erroneous  and  provides 
another,  different  value,  it  is  possible  that  some  of  the  processors  may  have 
commenced  further  operations  based  on  the  erroneous  value  and  may  even 
have  broadcast  the  results  of  such  operations.  It  is  clearly  essential  that 
these  erroneous  results  be  retracted  as  rapidly  as  possible.  This  problem  is 
not  unique  to  the  Intersecting  Broadcast  Machine,  but  arises  in  almost  all 
fault-tolerant  distributed  systems.  It  has  been  investigated  by  Randell  and 
Merlin[l],  who  refer  to  it  as  the  chase  protocol  problem,  and  by  Liskov  et 
al.[2],  who  call  it  orphan  detection.  The  chase  protocol  is  so  named  because 
the  retraction  message  must  chase  after  erroneous  results,  possibly  through 
several  intermediate  processors.  Of  interest  with  regard  to  these  algorithms 
is  the  question  of  convergence,  namely,  whether  the  chase  after  erroneous 
values  ever  actually  terminates.  Convergence  depends  on  the  ratio  of  the 
time  for  computation  and  transmission  of  new  result  values  to  the  time 
it  takes  to  propagate  the  chase  messages.  Provision  may  be  made  to  give 
chase  messages  priority  over  other  broadcasts,  su  as  lo  improve  convergence 
of  the  protocol. 

The  first  of  the  alternatives,  in  which  the  auxiliary  broadcasts  are  gen¬ 
erated  by  the  processors  that  supply  the  input  values,  matches  closely  the 
approach  required  for  imreplicated  operation.  However,  the  second  alter¬ 
native,  in  which  the  auxiliary  broadcasts  are  generated  by  processors  at 
intersections,  has  better  fault  tolerance  properties.  In  particular,  as  re¬ 
gards  the  second  alternative,  if  one  of  the  values  is  broadcast  by  only  one 
node,  while  the  other  is  broadc<ist  by  two,  the  result  of  the  operation  is 
broadcast  by  two  nodes.  Of  course,  all  three  independently  computed  val¬ 
ues  are  not  available  on  each  bus  for  majority  voting,  if  so  required.  Some 
buses  may  have  only  two  values  available,  allowing  error  detection  but  not 
correction.  Consequently,  the  second  alternative  permits  recovery  from  a 
situation  in  which,  for  whatever  reason,  a  value  is  broadcast  by  only  one 
node  of  the  network.  The  first  alternative  does  not  possess  this  property. 

We  must  consider  not  only  the  possibility  of  processor  failure,  but  also 
bus  failme.  First,  we  note  that  solid  failure  of  a  bus  prevents  all  of  the  nodes 
along  it  from  receiving  any  further  broadcasts  on  that  bus.  Consequently, 
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since  that  entire  row  of  processors  can  no  longer  be  at  an  intersection, 
they  aje  therefore  essentially  lost  to  che  systecn.  Thus,  in  view  of  the  very 
detrimental  effect  on  system  performance,  the  design  should  attempt  to 
minimize  the  rate  of  bus  failure. 


4 


n(A.B) 


Figure  2.9;  The  Effect  of  Bus  Failure.  Some  broadcasts  do  not  take  place,  and 
some  of  the  processors  may  not  be  inhibited.  There  may  be  three  broadcasts  of 
a  value  in  one  direction,  but  only  a  single  broadcast  in  the  other  direction. 

Immediately  after  a  bus  failure,  the  processors  along  the  bus  may  still 
be  executing  operations  based  on  values  received  prior  to  failure  of  the  bus. 
The  effect  of  bus  failure  on  such  operations  is  shown  in  Fig.  2.9.  If  one  of 
the  processors  connected  to  the  faulty  bus  wins  the  race  in  its  group  of  four, 
it  will  be  able  to  broadcast  its  result  on  the  working  bus  but  not  on  the 
one  that  has  failed.  Consequently,  the  other  processor  of  the  group  of  four 
that  is  attached  to  the  faulty  bus  will  not  receive  the  broadcast  and  will 
not  know  that  it  should  refrain  from  broadcasting  its  result  value.  Thus, 
as  is  shown  in  the  figure,  the  value  may  be  broadcast  on  three  buses  in  one 
direction,  but  on  just  one  bus  in  the  wher. 

Fig.  2.10  sho  s  the  possible  next  computation,  with  the  values  renamed 
for  convenience.  The  value  C  is  broadcast  on  three  vertical  buses  and  one 
horizontal  bus,  while  D  is  broadcast  on  two  buses  in  each  direction. 
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Figure  2.10:  Continued  Computation  after  Bus  Failure.  To  examine  the  effects 
of  continued  computation  after  bus  failure,  we  consider  a  computation  in  which 
one  of  the  arguments  has  the  three /one-broadcast  pattern. 
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Figure  2.11:  Recovery  from  Bus  Failure.  Statcing  vdth  one  argument  in  the 
three/one  pattern,  the  next  result  is  broadcast  by  two  processors  and  carried  on 
two  buses  in  each  direction. 
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Fig.  2.11  shows  the  result  of  the  next  computation.  The  value  f(C,D)  is 
broadcast  on  two  horizontal  and  two  vertical  buses,  as  desired.  On  three 
of  these  buses,  at  least  three  independently  computed  values  are  available 
for  majority  voting  if  necessary.  But  there  may  be  one  bus  on  which  only 
two  values  are  available.  Thus,  during  recovery  from  bus  failure,  some 
buses  may  transiently  only  be  able  to  provide  error  detection  but  not  error 
correction. 


2.6  Bus  Structure 

The  architecture  is  built  largely  around  the  bus  structure.  The  buses  might 
be  quite  similar  in  design  to  those  employed  by  such  companies  as  EXLSI, 
Sequent,  Encore  and  Alliant.  Such  buses,  which  can  accommodate  up  to 
30  interfaces,  currently  run  at  about  40  MHz  with  32  or  64  data  signals. 
100  MHz  or  more  may  soon  be  possible,  and  optical  versions  of  such  a  bus 
should  be  able  to  operate  at  even  higher  rates.  Transfers  across  the  bus 
mtist  carry  both  an  identity  for  the  value  aind  the  value  itself,  which  might 
be  as  small  as  a  single  word  or  could  be  quite  large. 

Since  the  buses  are  of  course  contention  buses,  arbitration  circuitry 
is  required  to  allocate  their  iise  to  the  processors.  This  circuitry  can  be 
separate  from  the  bus  itself  and  need  impair  performance. 

The  regular  two-dimensional  bus  structure  has  physical  characteristics 
that  are  suitable  for  mass  production  and  permit  the  design  of  very-high- 
performance  buses.  In  particular,  a  two-dimensional  structure  is  very 
appropriate-perhaps  even  essential-if  the  array  processor  is  to  be  imple¬ 
mented  directly  across  a  single  wafer,  as  may  be  possible  in  the  future. 
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2.7  Performance  Model 


Since  all  the  processors  in  a  row  or  column  must  share  the  same  bus,  there 
is  contention  for  use  of  the  buses;  access  to  them  therefore  can  become  a 
limiting  factor  in  the  design.  As  the  array  is  made  larger,  with  constant 
processor  and  bus  performance,  there  will  necessarily  come  a  point  at  which 
the  buses  become  overloaded.  When  this  occurs,  further  increases  in  array 
size  will  add  little  to  overall  performance.  Hence,  the  design  scales  well 
only  up  to  the  point  at  which  the  buses  become  saturated. 

The  number  of  processors  that  can  share  a  bus  withou*^  saturating  it 
depends  on  (l)  the  ratio  of  the  performance  of  the  processors  .  >  the  perfor¬ 
mance  of  the  bus,  (2)  the  length  of  the  typical  computation,  and  (3)  the  size 
of  the  typical  result  to  be  transmitted  across  the  bus.  If  the  typical  compu¬ 
tation  is  very  short,  perhaps  a  single  operation  as  in  a  dataflow  machine, 
it  can  be  expected  that  the  time  to  perform  an  operation  will  be  less  than 
or  equal  to  the  time  to  transmit  the  result  (probably  a  single  word).  With 
many  processors  competing  for  the  bus,  saturation  is  clearly  inevitable. 

But  a  coarser  granularity  of  computation  should  increase  the  duration 
of  the  computation  by  more  than  it  increaises  the  size  of  the  results,  particu¬ 
larly  if  the  computations  and  data  structures  are  carefully  chosen.  Current 
designs  for  fast  buses,  such  as  will  be  necessary  here,  can  accommodate  up 
to  30  interfaces  on  the  bus.  Consequently,  a  good  objective  would  be  to 
find  programs  in  which  the  typical  program  fragment  takes  about  30  times 
as  long  to  execute  as  the  bus  needs  to  transmit  the  result. 

A  simple  queueing  theory  model  has  been  constructed  to  determine  the 
effect  of  the  ratio  of  processing  time  to  bus  transfer  time.  The  model,  as 
shown  in  Fig.  2.12,  considers  one  column  (or  row)  of  processors  together 
with  the  bus  that  serves  that  column. 
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y.gtire  2.12:  The  Queueing  Theory  Performance  Model.  The  model  considers 
one  column  of  n  processors  and  the  bos  that  serves  that  column. 


Let 


giving 


n  be  the  number  of  processors  in  a  row  or  column 
s  the  mean  time  for  a  processor  to  process  an  operation 

(excluding  time  in  queues  waiting  for  operands  or  the  processor) 
e  the  mean  utilization  of  a  processor 
m  the  ratio  of  bus  speed  to  processor  speed, 
s/m  as  the  mean  time  to  transmit  a  result  over  the  bus. 


Since  the  utilization  of  a  processor  is  c,  by  elementary  queueing  theory, 
the  mean  number  of  operations  being  performed  or  waiting  in  the  queue  to 
be  performed  by  that  processor  is 

Since  n  processors  make  requests  on  the  bus  and  since  the  bus  is  m  times 
as  fast  as  a  processor,  the  utilization  of  the  bus  is  ne/m.  Consequently, 
the  mean  number  of  operations  being  I  oadcast  or  queued,  waiting  to  be 
broadcast,  is 

ne 

m(m  —  ne) 

There  are  n*  processors  in  the  .««ystem  and  n  buses  (we  do  not  count  the 
orthogonal  set  of  buses,  since  each  broadcast  must  use  one  bus  from  each 
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set.  Thtis  the  number  of  concurrent  operations  to  achieve  the  processor 
utilization  of  e  is 

n^e 

1  —  e  m(m  —  ne) " 

Fig.  2.13  shows  the  results  of  this  queueing  model  for  a  system  with 
n  =  30.  When  m  =  30,  high  utilization  of  the  processors  requires  many 
more  concurrent  operations  than  processors,  but  acceptable  utilization  is 
obtained  when  the  number  of  concurrent  operations  is  equal  to  the  number 
of  processors.  For  a  slightly  slower  bus,  with  m  =  20,  there  is  a  loss  of 
processor  utilization  resulting  from  bus  saturation  when  there  are  a  great 
many  concurrent  operations.  But,  if  the  number  of  concurrent  operations  is 
comparable  to  the  number  of  processors,  the  loss  of  processor  utilization  is 
not  substantial.  Significantly  slower  buses,  as  when  m  =  10,  become  satu¬ 
rated  before  adequate  processor  utilization  is  achieved.  There  is  no  benefit 
from  using  much  faster  buses;  the  curve  for  m  =  50  is  indistinguishable 
from  that  for  m  =  30. 


2.8  Load  Balancing 

The  queueing  theory  model  above  maJies  an  assumption  that  the  load  is 
spread  across  the  array  uniformly  and  randomly.  But  how  can  we  be  sure, 
even  if  we  start  with  a  uniform  and  random  distribution,  that  execution  of 
the  program  will  not  tend  to  cluster  much  of  the  load  onto  a  few  processors, 
leaving  others  underutilized.  We  consider  a  simple  stochastic  model  of  the 
system: 

Let  Pi  be  the  proportion  of  the  load  on  the  tth  horizontal  bus 
qj  the  proportion  of  the  load  on  the  jth  vertical  bus 
r,y  the  proportion  of  the  load  on  processor  t,  j 
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Figure  2.13:  Processor  Utilizat''>n.  Processor  utilization  depends  on  the  number 
of  concurrent  operations  as  well  as  on  the  ratio  m  of  bus  to  processor  speed.  The 
figure  is  drawn  for  a  system  of  900  processors  and  two  sets  of  30  buses. 
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We  can  relate  these  by 


(1)  Ei  Pi  =  E>  9j  =  E.  j  r.;  =  1 

(2)  P,  =  Ey  r.y 

(3)  gy  =  Ei  r.y 

(4)  r.y  =  A:p,gj- 

The  first  of  these  equations  is  clearly  valid,  while  the  second  and  third 
depend  on  the  assumption  that  results  generated  by  different  processors 
are  drawn  from  the  same  size  of  distribution,  which  is  not  an  unreasonable 
premise.  Equation  (4)  is  open  to  some  doubt,  for  the  program  does  not 
select  operands  for  processing  entirely  at  random  and  thus  can  distort  the 
distribution.  However,  even  if  we  assume  the  validity  of  Equation  (4),  it 
is  evident  that  these  equations  have  no  unique  solution.  Thus  there  can 
be  no  expectation  that  the  load  will  remain  uniformly  distributed,  nor,  in 
particular,  any  expectation  that  the  load  will  be  self-stabilizing. 

But  all  is  not  yet  lost.  Consider  a  system  in  which  the  probability  of 
a  processor’s  performing  an  operation  depends  on  that  processor’s  current 
load. 

(1)  Ei  Pi  =  Ey  9y  =  E,.y  r.y  =  1 

(2)  A-  =  Ey  r,y 

(3)  9y  =  Ei  r.y 

(4)  r,y  =  kpiqj  x  or  r.y  =  0,  where  6  >  0  (6  =  0  is  the  previous  case) 

or  (4')  r,-;  =  (A:A7y)T+». 
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The  revised  version  of  (4)  contains  a  factor  r~.*  that  reduces  the  prob¬ 
ability  that  a  heavily  loaded  processor  will  undertake  an  operation  and, 
conversely,  increases  the  probability  that  a  lightly  loaded  processor  will  do 
so. 

Now  Pi  = 

and  =  fc Ej  9/^  if  Pi  7^  0> 

showing  that  p,-  is  independent  of  i. 

Thus  Pi  =  qj  =  i,  r.y  =  ;^,  and  A:  = 


We  do  not  expect  that,  in  practice,  the  actual  form  of  the  term  describ¬ 
ing  the  probability  of  a  processor’s  performing  an  operation  will  be  precisely 
as  presented  here;  this  form  was  chosen  to  facilitate  analysis,  with  the  ob¬ 
jective  being  merely  to  show  that  negative  feedback  cam  indeed  stabilize 
the  load.  The  race  algorithms  described  above  provide  this  feedback. 

Not  only  should  the  algorithms  of  the  system  spread  the  load  uniformly, 
but  they  should  aJso  be  stable.  Unfortunately,  since  a  more  detailed  analysis 
involves  both  queueing  theory  and  control  theory,  it  is  quite  difficult.  For 
the  present,  we  note  that  system  software  usually  appears  to  be  heavily 
damped  and  seldom  becomes  unstable. 

One  of  the  most  significant  differences  between  the  Intersecting  Broad¬ 
cast  Machine  and  other  array  processor  architectures  is  now  apparent. 
Other  architectures  require  a  particular  geometric  mapping  of  the  appli¬ 
cation  onto  the  array  in  order  to  obtain  minimal  communication  and  high 
performance.  The  Intersecting  Broadcaist  Machine  operates  on  the  basis  of 
a  random  allocation  of  the  application  to  the  processors  of  the  array,  and 
performs  acceptably  for  that  random  allocation.  There  are  many  applica¬ 
tions  for  which  the  particular  mapping  is  hard  to  find  or  does  not  exist, 
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and  so  the  resulting  performance  may  be  very  bad.  But  all  applications 
can  be  allocated  at  random  and  should  perform  acceptably  on  the  Inter¬ 
secting  Broadcast  Machine.  Furthermore,  a  particular  mapping  may  be 
seriously  disrupted  by  faulty  processors,  while  a  random  allocation  should 
not  be  affected  significantly.  Consequently,  the  Intersecting  Broadcast  Ma¬ 
chine  should  be  more  general  and  more  robust  than  other  array  processor 
architectures. 


2.9  Programming 

In  some  ways  the  architecture  resembles  a  data-driven  dataflow  machine. 
Most  of  v/hat  has  been  learned  about  dataflow  architectures  is  applicable, 
especially  as  regards  the  naming  of  values.  However,  there  are  significant 
differences.  In  a  dataflow  machine,  the  selection  of  data  to  be  processed  is 
determined  primarily  by  the  dataflow  program;  it  is  largely  independent  of 
the  values  of  the  data.  The  Intersecting  Broadcast  Machine,  in  contrast, 
is  perhaps  most  effective  when  the  selection  of  data  items  to  be  processed 
together  depends  on  the  values  of  the  data;  for  this  reason,  the  allocation 
of  data  to  processors  cannot  be  preplanned. 

Dataflow  architectures  are  usually  purely  functional,  whereas  it  is  possi¬ 
ble  to  operate  the  Intersecting  Broadcast  Machine  as  an  imperative  machine- 
while  taking  precautions,  of  course,  to  avoid  unintended  interactions  be¬ 
tween  concurrent  operations. 

Like  m«iny  other  dataflow  structures,  operations  are  necessarily  monadic 
or  dyadic  (one  or  two  inputs).  However,  it  is  possible  to  show  that,  if  the  n 
values  are  randomly  distributed  across  the  array,  the  number  of  broadcasts 
necessary  to  collect  all  n  values  at  one  node  is  not  increased  by  gathering 
them  in  pairs. 

Because  our  objective  is  to  reduce  communication  costs,  it  would  appear 
that  a  relatively  coarse  granularity  of  computation  will  be  appropriate.  Re¬ 
search  is  currently  in  progress  at  the  University  of  Illinois  and  elsewhere 
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on  automatic  decomposition  of  conventional  programs  into  fragments  of  an 
appropriate  granularity  that  can  be  executed  in  parallel.  This  research  is 
essential  to  the  effective  operation  of  the  University  of  Illinois  Cedar  super¬ 
computer;  it  promises  to  be  equally  effective  for  the  Intersecting  Broadcast 
Machine.  As  yet,  no  substantive  results  have  been  reported. 


2.10  Applications 

There  are  many  applications  for  which  the  Intersecting  Broadcast  Machine 
is  no  better  than  other  array  processors.  Where  the  geometric  structure  of 
the  application  matches  closely  the  structure  of  the  array  processor,  com¬ 
munication  among  processors  can  be  efficient  and  the  overall  performance 
of  the  array  processor  can  be  very  good.  Typical  applications  of  this  type 
are  image  and  signal  processing  and  the  solving  of  partial  differential  equa¬ 
tions.  But  even  programs,  whose  inner  loops  perform  regular  calculations 
suitable  for  array  processing,  may  have  substantial  sections  of  initialization 
and  analysis  that  are  not  so  regularly  structured  and  efficiently  processed. 
In  some  cases,  the  inefficiency  of  processing  these  unstructured  portions  of 
a  program  are  such  that  they  dominate  overall  processing  time. 

The  Intersecting  Broadcast  Machine  should  be  capable  of  running  a 
wider  range  of  programs  than  some  other  EU'ray  processors,  since  it  is  not 
dependent  on  finding  a  good  mapping  of  an  application  onto  the  array. 
Many  programs  have  a  very  complex  structure  that  is  often  not  well  un¬ 
derstood;  this  is  especially  true  of  command  and  control  progrcims  and  AI 
programs. 

An  example  of  an  application  that  appears  to  be  ideal  for  the  Intersect¬ 
ing  Broadcast  Machine  is  that  of  discrete  Monte  Ccirlo  particle  dynamics. 
This  application  is  very  important  to  Lawrence  Livermore  Laboratory  be¬ 
cause  it  allows  the  modeling  of  very  violent  events  that  are  not  well  modeled 
by  the  fluid  dynamic  approximations  used  for  events  that  are  less  violent. 
Discrete  particle  dynamic  simulations  track  each  particle  individually  and 
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model  the  interactions  among  particles.  If  the  event  is  sufficiently  violent, 
the  spatial  relationships  among  particles  can  differ  substantially  between 
time  steps,  so  that  it  is  continuously  necessary  to  reascertain  which  particles 
are  close  enough  to  one  another  to  interact.  The  nature  of  the  interaction 
calculations  depend  on  the  separation,  velocities,  and  types  of  the  inter¬ 
acting  particles  and,  in  addition,  may  involve  a  number  of  different  code 
sequences;  each  particle  may  interact  with  only  a  few  or  with  many  other 
particles,  depending  on  its  situation.  It  has  been  foimd  difficult  to  vectorize 
this  calculation,  and  the  differences  in  the  calculations  for  each  individual 
particle  make  an  SIMD  approach  less  effective.  The  huge  amount  of  calcu¬ 
lation  required,  however,  clearly  indicates  the  need  for  an  array  processor. 
From  a  superficial  standpoint,  the  Intersecting  Broadcast  Machine  appears 
to  be  quite  suitable. 

Rather  similar  calculations  that  might  also  be  very  appropriate  include 
the  problem  of  conflict  prediction  in  air  traffic  control  conflict  prediction 
problem  and  ray  tracing  for  three-dimensional  computer  graphics. 

Another  application  for  which  the  Intersecting  Broadcast  Machine  ap¬ 
pears  to  be  quite  suitable  is  SDI  Battle  Management.  Here  again  the  match¬ 
ing  of  information  from  diverse  sources  and  the  dynamic  allocation  of  battle 
resources  on  the  beisis  of  complex  optimization  criteria  may  well  be  beyond 
the  computational  abilities  of  sequential  processors.  Furthermore,  it  is  not 
easy  to  implement  these  procedures  on  conventional  array  processors.  The 
battle  management  software  is  likely  to  be  chzinged  and  elaborated  rather 
more  frequently  than  is  necessary  for  other  types  of  applications.  Such 
modifications  may  be  difficult  to  make  if  the  design  of  the  software  has  to 
be  tied  to  a  specific  mapping  of  the  application  onto  the  array  processor; 
an  architecture  that  also  allows  the  data  structures  to  be  readily  modified 
is  more  suitable.  The  battle  management  application  is  one  for  which  fault 
tolerance  is  clearly  imperative. 
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2.11  Conclusions 


The  Intersecting  Broadcast  Machine  array  processor  architecture  is  cur¬ 
rently  only  an  interesting  idea  that  shows  promise  of  being 

•  Effective  for  arbitrary  programs  that  cannot  be  mapped  onto  regulrr 
array  structures  and  that,  consequently,  perform  poorly  on  existing 
array  processors 

•  Capable  of  operation  in  a  fault-tolerant  mode 

•  Physically  structured  to  permit  high-performance  VHSIC  implemen¬ 
tation. 

There  is  still  much  to  be  done  befor<»  we  ran  be  confident  that  the  archi¬ 
tecture  will  indeed  perform  as  envisaged.  It  is  necessary  to  investigate,  in 
particular, 

•  More  details  of  the  design,  including 

—  Broadcast  protocols 
—  Naming  of  result  values 
—  Representation  of  programs 

—  Recognition  of  intersections  at  which  operations  must  be  per¬ 
formed 

•  Methods  of  programming  the  architecture 

•  Studies  of  sample  applications. 
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Part  III 

Broadcast  Protocols  for 
Distributed  Systems 


3.1  Abstract 


This  section  of  the  reporc  proposes  a  novel  reliable  broadcast  protocol  for 
the  link  level  cf  the  protocol  hierarchy.  The  protocol  exploits  the  broadcast 
nature  of  the  physical  communication  media  typically  used  in  local  area 
networks.  A  combination  of  positive  and  negative  acknowledgment  strate¬ 
gies  allows  reliable  operation  without  requiring  a  separate  acknowledgment 
from  every  recipient  of  a  message.  This  work  Wcls  undertaken  in  collabora¬ 
tion  V  ith  Professor  L.  E.  Moser  of  California  State  University,  Hayward. 


3.2  Introduction 

Many  distributed  computer  systems  use  a  communication  mechanism  that 
is  physically  a  broadcast  medium,  such  as  the  Ethernet  or  a  packet  radio 
system.  Other  common  communication  media,  such  as  the  token  ring, 
•  ould  function  as  broadcast  media,  even  though  they  are  not  normally 
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sc  used.  The  advantage  of  a  broadcast  communication  medium  is  that 
it  maJces  it  physically  possible  to  distribute  a  message  simultaneously  to 
several  destinations. 

There  axe  important  activities  in  a  distributed  computer  system  that 
involve  many  processors  simultaneously  cind  that  would  benefit  from  broad¬ 
cast  communication.  Among  these  are  scheduling  and  load  balancing,  syn¬ 
chronization,  access  to  distributed  information,  update  and  commit  for 
distributed  databases,  and  trcinsaction  logging. 

Existing  communication  protocols  do  not  allow  distributed  computer 
systems  to  maJce  use  of  this  broadcast  capability,  but  rather  require  all 
messages  to  be  point-to-point,  from  a  single  source  to  a  single  destination. 
If  the  nature  of  the  application  is  such  that  broadcast  communication  is  ai>- 
propi'T.te,  existing  systems  must  send  many  individual  messages  and  receive 
corresponding  individual  acknowledgments.  In  a  network  of  N  nodes,  this 
results  in  a  total  of  2N  messages,  when  perhaps  a  single  broadcast  message 
might  have  sufficed.  The  high  cost  of  broadcast  communication  is  not  only 
wasteful  of  the  communication  resource,  but  it  also  limits  the  size  of  the 
distributed  system  by  saturating  the  communication  system  and  discour¬ 
ages  the  use  of  truly  distributed  algorithms  because  of  their  unnecessarily 
high  communication  cost. 

Reliable  transmission  of  a  message  requires  the  ability  to  retransmit 
the  message  because  of  damage  or  loss  in  transit.  Within  the  ISO  protocol 
hierarchy,  the  primary  responsibility  for  ensuring  this  reliable  transmission 
across  th'  broadcast  communication  medium  lies  with  the  link-level  com¬ 
munication  protocol.  This  protocol  is  directed  towards  that  level  of  the 
hierarchy.  Consequently,  the  protocol  provides  only  services  appropriate  to 
the  link  level,  in  contrast  to  other  atomic  broadcast  protocols  that  ignore 
the  hierarchy  and  are  designed  to  be  entirelv  self-contained.  For  example, 
our  protocol  can  determine  whether  a  node  has  acknowledged  receiving 
a  message,  but  has  no  responsibility  for  network  membership  or  network 
reconfiguration  following  a  failure.  Some  of  what  we  describe  below  may 
also  be  relevant  to  other  levels  in  the  protocol  hierarchy,  particularly  the 
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transport  level  that  ensures  reliable  transmission  between  hosts. 

Most  existing  link-level  protocols  me  positive  acknowieu^ments,  in  which 
the  recipient  of  a  message  explicitly  transmits  an  acknowledgment  of  its 
receipt,  either  as  a  separate  message  or  as  part  of  another  message.  The 
sender  of  the  original  message  uses  a  timeout  to  trigger  retransmission  if  no 
acknowledgment  is  received  from  the  recipient.  In  a  broadcast  context,  such 
protocols  require  individual  acknowledgments  from  each  recipient,  even  if 
it  is  possible  (which  it  usually  is  not)  to  take  advantage  of  the  broad¬ 
cast  medium  to  disseminate  the  initial  message  to  all  recipients.  Thus, 
broadcasting  with  positive  acknowledgments  could  reduce  the  number  of 
messages  from  2N  to  N-fl,  which  is  still  far  from  taking  full  advantage  of 
the  broadcast  mediv.m. 

To  eliminate  this  overhead,  we  must  use  a  negative-acknowledgment 
strategy,  in  which  most  nodes  transmit  no  acknowledgment  if  they  receive 
a  message  successfully,  but  rather  transmit  a  negative  acknowledgment  if 
they  become  aware  that  they  have  not  received  a  message. 

We  should  also  iiote  that  realistic  systems  will  contain  many  semi¬ 
independent  processes  within  each  node.  The  overall  communication  sys¬ 
tem  may  need  to  deliver  the  broadcast  message  to  several  such  processes, 
but  such  delivery  is  not  the  responsibility  of  the  ISO  link-level  protocol. 
We  do  not  consider  further  multiple  delivery  within  a  node. 

The  negative-acknowledgment  broadcast  protocol  described  here  does 
involve  costs,  particularly  computation  costs  that  might  in  many  cases  be 
borne  by  an  interface  microprocessor.  There  are  also  delay  costs  that  must 
be  compared  with  the  delays  caused  by  the  heavier  communication  load 
of  existing  protocols  and  algorithms.  The  utility  of  such  protocols  also 
depends  on  some  assumptions: 

The  performance  of  individual  processors  and  the  demand  for  commu¬ 
nication  generated  by  such  powerful  local  computation  will  outstrip 
the  available  bandwidth  of  the  communication  medium. 
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•  Many  applications  will  require  distributed  computation  and  consis¬ 
tent  distributed  data  spread  across  a  local  distributed  system. 

•  Requirements  for  consistency  with  remote  sites  (beyond  the  broadcast 
communication  medium)  will  be  minimized. 

There  is  a  possibility  that  continued  progress  in  communication  tech¬ 
nology,  such  as  100  MHz  fiber  links,  will  eliminate  any  communication  bot¬ 
tlenecks  and  eliminate  the  need  for  more  efficient  broadcast  protocols.  But 
techi  iques  applicable  to  communications  axe  also  effective  in  enhancing  the 
performance  of  processing  nodes.  It  is  possible,  even  likely,  that  advances 
in  semicondu''tor  technology  will  allow  much  greater  increases  in  processor 
performance,  in  that  an  entire  processor  can  be  contained  in  a  small  con¬ 
trolled  package,  while  interprocessor  communication  will  be  subject  to  gross 
physical  constraints.  Consequently,  we  anticipate  that  the  communication 
medium  will  continue  to  be  a  limiting  resource  in  distributed  systems  and 
that  broadcast  protocols  will  become  an  important  technique  for  distributed 
systems. 

Given  a  reliable  and  efficient  broadcast  protocol,  it  then  would  be  pos¬ 
sible  to  take  advantage  of  it  tc  construct  efficient  distributed  application 
algorithms.  We  have  started  to  investigate  such  algorithms  for  distributed 
mutual  exclusion,  locking,  and  synchronization,  as  well  as  for  update  and 
commit  in  a  distributed  database. 


3.3  Existing  Protocols 

The  most  detailed  existing  description  of  a  reliable  broadcast  protocol  is 
that  by  Chang  and  MaxemchukjS].  Their  protocol  requires  that  all  mes¬ 
sages  paiss  through  an  intermediary  nodp  railed  the  token  site.  A  node 
wishing  to  broadcast  a  message  must  communicate  it  to  the  token  site,  us¬ 
ing  a  positive-acknowledgment  protocol.  Using  a  negative  acknowledgment 
protocol,  the  token  site  then  broadcasts  the  message  to  all  recipients;  any 
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missing  messages  are  detected  by  gaps  in  the  sequence.  The  use  of  a  single 
common  intermediary  makes  the  negative-acknowledgment  technique  more 
effective.  A  complex  token  passing  protocol  is  used  to  detect  failures  at 
the  token  site,  to  select  a  new  token  site,  and  to  retransmit  messages  af¬ 
fected  by  the  failure.  Although  two  messages  cind  one  acknowledgment  are 
required  for  every  message  broadcast  in  the  absence  of  errors,  the  token 
passing  protocol  can,  in  fact,  add  significantly  to  the  number  of  messages 
if  transmission  errors  are  frequent. 

Schneider  has  described  a  reliable  broadcast  protocol  capable  of  oper¬ 
ating  on  partially  connected  networks[4].  His  protocol  can  operate  in  a 
more  complex  network  structure  with  gateways,  but  does  not  have  high 
efficiency  on  a  local  network.  It  might  approriately  be  implemented  using 
the  protocol  described  here  at  the  local  level. 

Several  authors[5,6|  have  described  broadcast  protocols  in  which  each 
message  is  followed  by  an  empty  pause  or  a  null  message  for  a  token  ring. 
A  node  that  detects  the  presence  of  the  message,  but  is  unable  to  receive 
it  uncorrupted,  transmits  a  negative  acknowledgment  in  the  pause  or  null 
message.  Such  algorithms  are  effective  against  reception  faults,  but  not 
against  transmission  faults,  momentary  network-partitioning  faults,  or  pro¬ 
cessor  fail-stop  faults. 

A  further  cleiss  of  broadcast  protocols-asynchronous  atomic  broadcast 
protocols[7|-is  more  concerned  with  maintaining  completely  global  consis¬ 
tency  of  message  ordering  and  delivery  in  the  presence  of  node  failures  (the 
Chang  and  Mzixemchuk  protocol  also  provides  asynchronous  atomic  broad¬ 
casts)  .  Such  protocols  necessarily  involve  maintenance  of  a  configuration  of 
currently  operating  nodes  and  mechanisms  for  reconfiguration  in  the  event 
of  node  failure.  These  features  are  not  appropriate  to  the  link  level  of  the 
hierarchy,  but  are  more  appropriate  to  the  network,  transport,  and  higher 
levels  of  the  hierarchy.  They  can  be  built  on  top  of  the  reliable  link-level 
broadcast  provided  by  our  protocol. 
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3.4  Requirements  and  Objectives 


Our  objective  is  to  provide  a  reliable  link-level  or  transport-level  protocol. 
Messages  should  be  capable  of  being  broadcast  simultaneously  to  many  des¬ 
tinations,  without  the  need  for  explicit  acknowledgment  by  every  recipient. 
The  originator  of  the  message  should  be  assured  that  all  working  destina¬ 
tions  have  received  the  message,  or  that  one  or  more  destinations  did  not 
receive  the  message  and  that  it  should  therefore  be  retrainsmitted.  It  should 
also  be  possible  to  confirm  that  certain  specified  destinations  were  working 
and  did  receive  the  message. 

The  protocol  must  also  be  able  to  ensure  that  messages  from  one  source 
will  be  delivered  in  the  order  in  which  they  were  originated  by  that  source. 
Since  some  messages  may  have  to  be  retransmitted  to  compensate  for  errors, 
this  may  require  the  use  of  sequence  numbers  to  reorder  the  messages  after 
reception.  There  is  no  requirement  that  messages  from  different  sources  be 
received  in  any  particular  order. 

Reliable  commvmication  that  depends  on  backward  error  recovery  and 
retransmission  necessarily  incurs  a  delay  before  the  originator  of  a  message 
can  be  certain  that  all  the  intended  recipients  of  the  message  have  indeed 
received  it.  In  a  positive-acknowledgment  system,  that  delay  is  represented 
by  the  time  until  the  last  acknowledgment  is  received.  In  a  negative  ac¬ 
knowledgment  system,  the  situation  is  more  complicated.  Some  kinds  of 
messages  are  such  that  it  is  the  time  to  the  first  response  that  is  important 
(e.g.,  “Give  me  the  value  of  X”).  For  other  types  of  messages  (e.g.,  “Update 
the  cuiicit  value  of  X”),  the  delay  may  be  the  time  until  the  originator  can 
be  certain  that  every  working  node  has  received  the  message.  This  time 
is  much  less  certain  in  a  negative-acknowledgment  system,  but  may  be 
an  important  performance  parameter  in  some  contexts.  The  performance 
measures  for  the  protocol  must  therefore  be 

•  The  load  placed  on  the  communication  medium 

•  The  load  that  causes  the  medium  to  become  saturated 
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•  The  delay  incurred  until  the  originator  can  be  certain  of  delivery. 


Generally,  the  load  imposed  in  the  absence  of  errors  is  more  important 
than  the  additional  load  induced  when  errors  occur,  since  they  are  not 
very  frequent.  Similarly,  the  delay  until  confirmation  of  delivery  is  usually 
more  important  for  delivery  to  a  working  node.  Deduction  that  a  node  has 
failed  may  be  based  in  part  on  information  provided  by  this  level  of  the 
hierarchy  to  the  effect  that  no  response  has  been  obtained  from  the  node; 
the  decision,  however,  lies  above  the  link  level. 

The  protocol  must  operate  reliably  in  a  network  subject  to  a  variety  of 
faults.  Among  these,  in  particular,  are  the  following: 

•  Transmission  faults,  in  which  the  transmitted  message  is  either  not 
received  by  any  destination  or  is  received  in  a  garbled  condition  by 
all  destinations.  We  assume  that  transmission  faults  are  relatively 
infrequent. 

•  Reception  faults,  in  which  one  or  more  destinations  do  not  receive 
the  message  or  receive  it  garbled,  while  other  destinations  receive 
it  correctly.  Again  we  assume  that  reception  faults  are  relatively 
infrequent,  say,  substantially  fewer  than  one  error  per  N  messages  in 
an  N  node  system. 

•  Network-partitioning  faults,  in  which  the  network  is  divided  by  the 
fault  into  two  or  more  subnets,  with  communication  remaining  pos¬ 
sible  within  each  subnet  but  not  among  subnets. 

•  Node  fail-stop  faults,  in  which  a  node  ceases  operation.  We  assume 
that  a  failed  node  rejoining  the  network  is  aware  that  it  has  failed, 
since  treinsitional  acknowledgment  rules  must  be  applied. 

We  assume  that  a  node  can  apply  adequate  checks  to  a  message  it  has 
received  to  ensure  that  it  hcis  been  received  uncorrupted.  We  exclude  faults 
involving  babbling  nodes  and  faults  resulting  in  the  total  inability  of  all 
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nodes  to  use  the  communication  medium,  since  there  is  no  way  of  ensuring 
recovery  from  such  faults  with  a  single  communication  medium.  Malicious 
nodes  are  excluded.  We  also  exclude  faults  that  result  in  one  or  more 
pairs  of  nodes  being  systematically  unable  to  communicate,  even  though 
there  are  other  nodes  with  which  both  members  of  a  pair  caji  communicate. 
Such  faults  could  result  from  misadjusted  transmitters  and  receivers  that 
axe  marginally  operative  and  thus  able  to  communicate  with  some,  but 
not  all,  other  nodes.  We  exclude  this  type  of  fault  because  it  does  not  lie 
within  the  scope  of  a  link-level  protocol;  we  do  not  wish  to  complicate  the 
protocol  with  a  forwarding  requirement  that  is  properly  the  responsibility 
of  the  network  level. 


3.5  The  Broadcast  Protocol 

Expressed  in  informal  terms,  the  proposed  broadcast  link-level  protocol 
requires  that 

•  Each  message  be  broadcast  with  a  header  in  which  there  is  a  message 
identifier  containing  the  source  of  the  message  euid  a  message  sequence 
nximber.  A  version  number  is  also  included  in  the  identifier  to  distin¬ 
guish  retrajismissions.  Sequence  numbers  can  be  repeated  over  some 
suitably  long  interval.  The  message  also  carries  an  error-detecting 
code.  Other  fields  of  the  header,  such  as  a  message  destination  list, 
may  be  present  but  do  not  play  any  part  in  this  protocol. 

•  Each  node  maintains  a  list  of  positive-  and  negative-acknowledgment 
message  identifiers.  Whenever  it  broadcasts  a  message,  it  appends 
this  list  of  acknowledgments  to  the  message  and  then  clears  its  list. 

•  When  a  node  receives  a  message  not  previously  received  in  an  uncor¬ 
rupted  state,  it  adds  the  identifier  as  an  acknowledgment  to  its  list. 
If  the  message  is  uncorrupted,  the  identifier  is  added  as  a  positive  ac¬ 
knowledgment;  if  the  message  is  corrupted  but  with  an  uncorrupted 
header,  the  identifier  is  added  as  a  negative  acknowledgment. 
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•  When  a  node  sees  a  positive  acknowledgment  appended  to  a  message 
that  it  receives,  it  deletes  from  its  own  list  any  positive  acknowledg¬ 
ment  for  that  message.  When  it  sees  a  negative  acknowledgment  for  a 
message,  it  deletes  from  its  list  any  acknowledgment  for  that  message, 
whether  positive  or  negative. 

•  When  a  node  sees  a  positive  acknowledgment  for  a  message  that  it 
has  not  received,  it  adds  a  negative  acknowledgment  to  its  list. 

•  Tf  a  nod^  has  no  messages  pending,  it  may  be  necessary  to  .  jnstruct 
a  null  message  to  carry  acknowledgment  messages.  The  acceptable 
delay  before  tramsmitting  a  null  message  may  differ  for  positive  and 
negative  acknowledgments. 

•  When  a  node  receives  a  negative  acknowledgment  for  one  of  its  mes¬ 
sages,  or  has  received  no  positive  acknowledgment  within  some  time 
interval,  it  retransmits  the  message.  The  retransmission  must  be 
identical  to  the  prior  transmission,  and  thus  must  carry  with  it  all 
of  the  acknowledgments,  positive  or  negative,  carried  by  the  prior 
transmission  of  that  message. 

As  an  example,  consider  the  following  message  sequence,  in  which  upper¬ 
case  letters  represent  messages  (we  do  not  bother  to  denote  the  source  of 
the  message  directly),  lower-case  letters  represent  acknowledgments,  and 
overhead  bars  denote  negative  acknowledgments. 

A  Ba  Cb  Dc  Cbd  Ec 

Kere  the  negative  acknowledgment  of  C  that  accompanies  message  D  trig¬ 
gers  a  retransmission.  Note  that  the  node  broadcasting  message  E  also 
acknowledges  message  C;  in  doing  so,  it  Implicitly  acknowledges  messages 
B  and  D  and  through  B  message  A  as  well.  This  implicit  acknowledgment 
is  the  basis  of  the  reliability  property  described  below. 
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The  effect  of  missing  several  messages  can  be  considered  in  this  example. 


A  Ba  Cb  Dc  Cbd  Ecb  Bae  Fb 

Here  the  node  broadcasting  message  D  received  message  C  gairbled  and  saw 
nothing  of  message  B.  When  C  is  retransmitted  with  a  positive  acknowl¬ 
edgment  for  B,  that  node  becomes  aware  that  it  missed  B  and  transmits  a 
negative  acknowledgment.  Thus  a  short  sequence  of  missing  messages  can 
be  recovered;  however,  it  would  be  unwise  to  depend  on  this  technique  for 
recovery  from  a  lengthy  node  failure. 

3.5.1  Notes 

Depending  on  the  format  of  messages  and  the  form  of  error-detecting  codes 
used,  it  may  not  be  possible  to  determine  with  confidence  the  identifier  of 
a  message  that  is  received  corrupted.  If  so,  nodes  that  receive  such  cor¬ 
rupted  messages  cannot  enqueue  a  negative  acknowledgment  for  fear  that 
the  identifier  might  be  incorrect,  but  instead  must  simply  ignore  the  cor¬ 
rupted  message.  If  some  other  node  has  received  the  message  uncorrupted 
and  broadcasts  an  acknowledgement,  then  one  or  more  of  the  nodes  that 
received  the  message  corrupted  will  generate  the  negative  acknowledgment, 
baised  on  the  positive  acknowledgment  for  a  message  that  they  have  not  yet 
seen.  If  no  node  receives  the  message  uncorrupted,  no  positive  acknowl¬ 
edgment  will  be  generated  and  the  originating  node  will  retransmit  the 
message  after  the  timeout.  Because  of  the  nature  of  the  acknowledgment 
protocol,  the  timeout  need  not  be  long  and  thus  the  effect  on  performance 
should  be  negligible. 

It  is  permissible  but  not  essential  for  a  node  to  broadcast  a  positive 
acknowledgment  for  a  message  that  it  had  already  received  uncorrupted. 
Nodes  should  not  broadcast  negative  acknowledgments  for  such  messages, 
as  this  can  cause  additional,  unnecessary  retransmissions,  possibly  never 
terminating  if  errors  are  sufficiently  frequent. 
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Because  a  retransmission  must  be  identical  to  each  previous  transmis¬ 
sion  of  the  same  message,  a  node  that  receives  a  message  carrying  a  neg¬ 
ative  acknowledgment  of  one  of  its  own  messages  must  not  append  the 
positive  acknowledgment  of  that  message  to  the  retransmission;  the  posi¬ 
tive  acknowledgment  must  wait  in  the  queue  for  some  subsequent  message. 
Permitting  further  acknowledgments  to  be  added  to  a  message  on  retrans¬ 
mission  would  preclude  a  node  that  has  already  received  the  message  from 
ignoring  the  retransmission,  ajid  would  thus  risk  incurring  the  nontermi¬ 
nating  sequence  of  retransmissions. 

When  a  node  joins  or  rejoins  an  already  operating  network,  the  first 
few  positive  acknowledgments  that  it  receives  will  be  for  messages  that 
were  broadcast  prior  to  its  entry  into  the  network  and  that  it  therefore 
has  not  received.  If  the  node  broadccists  negative  acknowledgments  for 
those  messages,  forcing  their  retransmission,  it  will  receive  with  those  mes¬ 
sages  the  positive  acknowledgments  to  even  earlier  messages.  This  results 
in  replaying  the  entire  message  history  of  the  network  in  reverse  order! 
To  avoid  this,  we  require  that  a  processor  joining  or  rejoining  the  net¬ 
work  should  broadcast  negative  acknowledgments  only  for  messages  with 
sequence  numbers  greater  than  the  sequence  number  of  a  message  that  it 
has  received  correctly. 

The  description  of  the  protocol  given  above  is  a  rather  operational  de¬ 
scription  that  requires  immediate  performance  of  the  operations,  without 
regard  to  other  node  performance  constraints  or  the  need  to  make  con¬ 
tinuous  use  of  the  broadcast  medium.  Clearly,  the  performance  of  the 
protocol  is  improved  if  each  node  can  respond  very  rapidly  to  each  message 
it  receives.  Ideally,  on  seeing  an  acknowledgment  appended  to  a  message, 
the  node  should  be  able  to  ensure  that  it  will  not  also  transmit  the  same 
acknowledgment,  even  if  it  -r  next  in  line  to  transmit  a  message  that  has 
already  been  prepared  with  that  same  acknowledgment  attached.  Similarly, 
on  receiving  a  message,  a  node  might  be  able  to  include  the  acknowledgment 
with  its  next  message,  even  if  the  latter  must  be  transmitted  immediately. 

In  practice,  however,  it  takes  time  to  check  the  cyclic  redundancy  check 
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code,  manipulate  acknowledgment  queues,  amd  construct  message  packets, 
while  efficient  use  of  the  communication  medium  requires  that  the  next 
message  be  transmitted  with  as  little  delay  as  possible.  The  idealized  expec¬ 
tation  that  reception  of  a  message  can  be  reflected  in  the  acknowledgments 
that  ciccompany  the  next  message  is  unrealistic.  Nevertheless,  delays  in 
broadcasting  acknowledgments  and  extra  acknowledgments,  either  positive 
or  negative,  have  no  logical  effect  on  the  protocol  and  only  a  very  small 
effect  on  performance.  Thus  it  is  of  little  significance  if  processing  con¬ 
straints  do  not  permit  immediate  acknowledgment  or  immediate  removal 
of  acknowledgments  from  pending  messages,  so  that  seme  acknowledgments 
are  delayed  a  few  messages  while  others  are  broadcast  twice.  The  formal 
temporal  logic  specifications  impose  temporal  constraints  on  acknowledg¬ 
ments  that  do  not  imply  the  unrealistic  requirements  inherent  in  the  be¬ 
havioral  description. 


3.6  Reliability  Property 

Provided  that  the  proportion  of  messages  received  corrupted  is  much  less 
than  1/N  for  an  N  node  network  and  that  there  are  no  pairs  of  nodes 
that  are  systematically  unabh  t  communicate,  the  protocol  appears  quite 
robust.  We  can  define  for  it  a  strong  reliability  property: 

When  a  node  acknowledges  a  message,  if  there  are  no  unac¬ 
knowledged  messages  prior  to  that  message  emd  if  no  prior 
message  has  an  outstanding  negative  acknowledgment,  then  the 
node  must  have  received  correctly  every  message  prior  to  the 
message  it  acknowledged. 

The  proof  of  this  property  is  based  on  the  representation  of  messages 
and  acknowledgments  as  a  finite  diro^'ted  acyclic  graph  Nodes  of  the 
graph  represent  messages,  while  it  edges  represent  positive  acknowledg¬ 
ments.  We  use  the  term  graph  node  to  denote  the  nodes  of  the  graph,  so 
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as  to  distinguish  them  from  network  nodes.  The  construction  of  the  graph 
Gji  is  as  follows: 

•  Transmission  (or  retransmission)  of  a  message  M  adds  a  graph  node 
M  to 

•  Transmission  (or  retransmission)  of  a  message  M  with  a  positive  ac¬ 
knowledgment  of  message  N  adds  an  edge  from  graph  node  M  to 
graph  node  N 

•  Transmission  (or  retransmission)  of  a  negative  acknowledgment  of 
message  N  deletes  the  graph  node  N  and  all  its  in  and  out  edges. 

Lemma  1.  The  graph  G^  is  acyclic. 

Proof.  In  the  construction  of  G^,  an  edge  is  added  from  node  M  to 
node  N  if  message  M  acknowledges  message  N;  thus  message  M  must  have 
been  sent  after  message  iV. 

Lemma  2.  If  there  are  no  vmacknowledged  messages,  there  exists  a 
single  root  in  the  graph  G/.. 

Proof.  If  there  are  no  unacknowledged  messages,  every  node  of  G^, 
except  for  the  one  most  recently  inserted,  has  an  in  edge.  Thus,  the  most 
recently  inserted  node  is  the  root  of  G^. 

Lemma  3.  If  an  acyclic  graph  G  has  a  single  root  R,  every  node  of  G 
is  reachable  from  R. 

Proof.  This  is  a  standard  result  in  graph  theory  whose  proof  follows 
by  structural  induction  on  the  set  of  subgraphs  of  G  with  the  subgraph 
relation. 

Lemma  4.  If  there  are  no  unacknowledged  messages  and  if  the  most  re¬ 
cently  transmitted  message  A  does  not  contain  a  negative  acknc>/ledgment 
of  Z,  then  the  network  node  originating  A  has  received  Z  or  a  previous 
version  thereof  correctly. 
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Proof.  Consider  the  graph  constructed  above.  Let  P  be  a  path 
from  i4  to  Z  in  The  proof  is  by  structural  induction  on  the  set  of 
subpaths  of  P  that  start  at  A  with  the  subpath  relation. 

Base.  P  =  ({A,  Z},  {(.<4,  Z)}).  The  lemma  follows. 

Step.  P  ^  ({.4,  Z},  {(A, Z)}).  Assume  that  the  lemma  holds  for  all 
subpaths  P'  of  P  that  start  at  A.  Let  Y  be  the  immediate  predecessor  of 
Z  on  the  path  P,  and  let  P'  be  the  subpath  of  P  from  A  to  Y.  By  the 
inductive  assumption,  the  network  node  originating  A  has  received  Y  or 
a  previous  version  thereof  correctly.  Furthermore,  Y  contains  a  positive 
acknowledgment  of  Z.  Hence,  A  knows  of  the  existence  of  Z.  Since  A  does 
not  contain  a  negative  acknowledgment  of  Z,  the  network  node  originating 
A  has  received  Z  or  a  previous  version  thereof  correctly. 

Theorem  1.  If  there  are  no  unacknowledged  messages  and  no  out¬ 
standing  negative  acknowledgments,  then  the  node  that  sent  the  most  re¬ 
cent  message  has  received  all  messages  correctly. 

Proof.  Consider  the  graph  constructed  above.  By  Lemmas  1  and 

2,  Gx  is  acyclic  and  has  a  single  root  A,  which  corresponds  to  the  most 
recent  message  sent.  Let  M  be  an  arbitrary  message.  Then,  by  Lemma 

3,  there  exists  a  path  P  from  A  to  M.  By  Lemma  4,  the  network  node 
originating  A  has  received  M  or  a  previous  version  thereof  correctly. 

We  can  also  provide  predicates  on  the  message  history  that  determine 
whether  a  given  node  has  received  a  specific  message  correctly  and  thus, 
by  enumeration,  whether  all  nodes  have  received  a  specific  message  cor¬ 
rectly.  Again  the  proof  of  these  properties  is  based  on  the  representation 
of  messages  and  acknowledgments  as  a  finite  directed  acyclic  graph  Gx- 
The  graph  differs  from  the  one  above  in  that  edges  of  the  graph  represent 
positive  or  negative  acknowledgments  or  retransmissions.  The  construction 
of  the  graph  Gx  is  as  follows: 

•  Transmission  (retransmission)  of  a  message  M  (Afi)  adds  a  graph 
node  M  (Mi)  to  Gx 
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•  Transmission  of  a  message  M  with  a  positive  (negative)  acknowledg¬ 
ment  of  message  N  adds  an  edge  labeled  positive  (negative)  from 
graph  node  M  to  graph  node  N 

•  Retransmission  of  a  message  M\  adds  an  edge  labeled  retransmission 
from  graph  node  to  graph  node  M. 

Lemma  5.  If  there  exists  a  path  of  positive  acknowledgments  in 
from  A  to  Z  and  no  negative  acknowledgment  has  been  issued  for  any 
message  M  on  the  path  by  A  or  by  a  message  N  that  has  been  acknowledged 
(directly  or  indirectly)  by  A,  then  the  network  node  originating  A  has 
received  Z  correctly. 

Proof.  Consider  the  graph  constructed  above.  Let  P  be  a  path 
from  yl  to  Z  in  G^.  The  proof  is  by  structural  induction  on  the  set  of 
subpaths  of  P  that  start  at  A  with  the  subpath  relation. 

Base.  P  =  {{A,Z},{{A,Z))).  The  lemma  follows  even  without  the 
second  hypothesis. 

Step.  P  7^  ({j4, Z}, {(i4, Z)}).  Assume  that  the  lemma  holds  for  all 
subpaths  P'  of  P  that  start  at  A.  Let  Y  be  the  immediate  predecessor 
of  Z  on  the  path  P,  and  let  P'  be  the  subpath  of  P  from  A  to  V.  By 
the  inductive  assumption,  the  network  node  originating  A  has  received  Y 
correctly. 

Suppose  now  that  the  network  node  originating  A  did  not  receive  Z 
correctly.  Then,  since  the  network  node  originating  A  saw  F’s  positive 
acknowledgment  for  Z,  either  A  contains  a  negative  acknowledgment  for 
Z  or  there  exists  a  negative  eicknowledgment  for  Z  contained  in  a  message 
that  the  network  node  originating  A  has  acknowledged.  In  either  case,  we 
have  a  contradiction  of  the  second  hypothesis. 

Theorem  2.  If  there  exists  a  path  of  positive  acknowledgments  or 
retransmissions  in  Gx  from  A  to  Z  and  no  negative  acknowledgment  has 
been  issued  for  any  message  M  on  the  path  by  A  or  by  a  message  N  that 
has  been  acknowledged  (directly  or  indirectly)  by  A,  then  the  network  node 


89 


originating  A  has  received  Z  or  some  version  thereof  correctly. 

Proof.  By  direct  extension  to  the  proof  of  Lemma  5. 

The  various  situations  involved  in  Lemma  5  and  Theorem  2  axe  depicted 
in  Figure  3.1. 


Figure  3.1:  Determination  of  the  Receipt  of  a  Message.  Analysis  of  the  graph  en¬ 
ables  one  to  conclude  that  message  Z  has  been  received  by  the  node  broadcasting 
the  three  messages  A,  B  and  C. 

We  are  currently  working  on  a  more  formal  statement  of  the  protocol 
and  an  accompanying  more  formal  proof  of  this  reliability  property. 


3.7  Performance  Model 


In  order  to  compare  the  broadcast  protocol  with  existing  link-level  pro¬ 
tocols,  a  simple  queuing  theory  analysis  has  been  done.  To  ensure  a  fair 
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comparison,  we  require  for  the  reliable  broadcast  protocol  that  every  node 
broadcast  a  message,  possibly  null,  within  a  prescribed  time  interval  to 
ensure  that  the  originators  of  broadcast  messages  can  be  certain  that  ev¬ 
ery  recipient  has  the  message.  We  shall  compare  the  time  to  obtain  such 
positive  acknowledgment  with  the  corresponding  time  for  other  protocols. 
This  positive-acknowledgment  comparison  imposes  a  heavy  burden  on  the 
negative-acknowledgment  broadcast  protocol,  but  by  almost  any  other  mea¬ 
sure  the  broadcast  protocol  is  so  much  better  that  there  is  little  point  in 
even  making  a  comparison. 

Consider  first  a  simple  point-to-point  positive-acknowledgment  system. 
Let  the 

Number  of  nodes  in  the  network  be  n 
Time  to  transmit  a  message  be  s 
Ratio  of  the  time  to  transmit  a  message  to  the  time 
to  transmit  an  acknowledgment  be  p 
Proportion  of  messages  awaiting  broadcast  be  r 
Rate  of  demand  for  message  transmission  be  i/. 

Then  the  load  on  the  broadcast  medium  is 

A  =  51/(1  -  r  4-  nr)(l  +  p), 

zmd  the  time  to  broadcast  a  message  and  receive  the  corresponding  ac¬ 
knowledgments  is 

5(1  +  p)(l  ~  r  +  nr) 

1-A 

This  may  be  rather  optimistic,  since  it  assumes  random  initiation  of  broad¬ 
casts  and  thus  understates  the  amount  of  contention  that  arises  between 
message  broadcasts  and  attempts  to  acknowledge  prior  broadcasts.  Careful 
implementation  of  such  a  protocol  may  succeed  in  reducing  such  contention. 
To  some  extent,  it  also  disregards  the  effects  of  disparities  in  the  lengths  of 
messages  and  acknowledgments. 
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Turning  to  the  reliable  broadcast  protocol,  we  must  define  the  time 
period  for  which  a  node  must  wait  before  sending  a  null  message  to  indicate 
that  it  is  still  present  in  the  network  and  has  received  prior  broadcasts.  We 
also  denote  by  q  the  probability  that  a  node  will  not  have  transmitted  a 
regular  message  within  d  and  thus  will  require  a  null  message. 

Then,  the  load  on  the  broadcast  medium  is 


A  =  si/(l  -  r)(l  +  p)  +  st/r(l  +  npq) 

where 


while  the  delay  incurred  before  it  is  certcun  that  the  broadcast  message  has 
been  received  by  all  destinations  is 

s(l  —  r)(l  +  p)  -J-  rs(l  -f  npq) 

1  '  • 


These  equations  were  solved  numerically  by  using  a  simple  Pascal  pro¬ 
gram  to  obtain  the  results  shown  in  the  following  figures. 

Figure  3.2  compares  the  time  to  receipt  of  a  positive  acknowledgment  in 
systems  of  three  sizes-10,  20  and  50  nodes.  We  assume  that  transmission  of 
a  typical  message  requires  use  of  the  broadcast  medium  for  1  ms,  while  an 
acknowledgment  alone  requires  only  0.1  ms.  In  this  figure,  we  assume  that 
a  node  will  transmit  an  acknowledgment  within  100  ms  if  it  has  not  sent 
any  other  message  within  that  time.  As  expected,  the  results  of  the  analysis 
show  that  a  10-node  point-to-point  protocol  becomes  satmated  at  about  90 
messages  per  second,  a  20-node  system  at  about  45  messages  per  second, 
and  a  50-node  system  at  about  18  messages  per  sec  d  In  contrast,  the 
10-  and  20-node  broadcast  protocol  results  are  almost  .;ical,  becoming 
saturated  at  about  1000  messages  per  second;  at  that  loaa  all  acknowledg¬ 
ments  can  be  piggybacked  onto  other  messages.  In  the  50-node  system,  the 
broadcast  protocol  becomes  saturated  at  about  250  messages  per  second. 
At  all  sizes,  the  broadcast  protocol  provides  an  order-of-magnitude  increase 
in  potential  traffic  load. 
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Figure  3.2;  Comparison  of  Times  to  Positive  Acknowledgment  for  Point-to-point 
and  Broadcast  Protocols. 
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It  is  also  appropriate  to  investigate  the  effect  of  a  node’s  waiting  time 
before  broadcasting  an  acknowledgment  when  it  has  no  other  message  to 
tramsmit.  When  the  delay  is  as  long  as  one  second,  Figure  3.3  shows  that 
the  results  for  systems  containing  10,  20  and  50  nodes  are  almost  identical 
and  that  all  become  saturated  at  about  1000  messages  per  second.  But,  of 
course,  the  time  to  positive  acknowledgment  is  long.  Reducing  the  delay 
to  10  ms  greatly  reduces  the  time  to  positive  acknowledgment,  but  now 
causes  all  sizes  to  become  saturated  below  1000  messages  per  second.  In 
each  case,  however,  the  broadcast  protocol  is  able  to  support  much  more 
traffic  than  a  point-to-point  protocol,  with  comparable  times  to  positive 
acknowledgment. 

Finally,  we  consider  the  possibility  that  not  all  of  the  messages  require 
broadcasting  to  all  other  nodes;  some  messages  are  intended  for  only  a  single 
destination.  Figure  3.4  shows  results  for  100%,  10%,  and  1%  of  all  mes¬ 
sages  requiring  broadcast.  The  point-to-point  protocol  shows  substantially 
better  performance  as  the  proportion  of  broadcast  messages  diminishes. 
The  broadcast  protocol  results  are  identical  except  for  the  50  node,  100% 
broadcast  case.  Clearly,  the  advantage  of  the  broadcast  protocol  lessens 
coramensurately  as  the  proportion  of  broadcast  messages  is  reduced. 
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Figure  3.3:  The  Effect  of  Delay  Time  on  the  Protocol  Performance. 
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Figure  3.4:  The  Effect  of  the  Proportion  of  Messages  Broadcast  on  the  Protocol 
Pcrfonnance. 
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3.8  Broadcast  Algorithms  for  Mutual 
Exclusion  and  Distributed  Update 


We  have  started  to  consider  various  applications  for  which  the  reliable 
broadcast  protocol  would  be  advantageous.  Mutual  exclusion,  locking,  and 
synchronization  algorithms  exemplify  an  application  in  which  broadcast 
communication  can  provide  substantial  benefits.  A  simple  mutual  exclusion 
protocol,  based  on  the  algorithms  of  Ricart  and  Agrawala(8],  uses  claim, 
reject,  and  release  messages.  A  node  seeking  the  lock  broadccists  a  claim 
message  and  waits.  If  no  other  node  disputes  this  claim  by  broadcasting  a 
reject  message  within  that  period,  the  node  may  enter  the  critical  section. 
A  node  may  broadcast  a  release  message  as  it  leaves  the  critical  section, 
though  such  messages  are  necessary  only  when  other  nodes  are  waiting. 
Contention  among  nodes  can  be  resolved  by  timestamps  in  the  usual  way [9]. 

The  above  protocol  will  work  under  ideal  conditions,  but  is  hardly  ro¬ 
bust;  any  one  of  a  number  of  errors  could  result  in  more  than  one  node 
in  the  critical  section  simultaneously.  There  is  no  easy  way  to  guarantee 
recovery  when  a  node  fails  while  in  the  critical  section  other  than  through 
an  audit  and  restoration  of  the  shzu-ed  resource.  However,  much  can  be 
done  to  make  the  mutual  exclusion  protocol  more  robust. 

The  protocol  can  be  refined  by  defining  a  caucus  of  nodes  responsible 
for  administration  of  the  lock.  Only  members  of  the  caucus  maintain  a 
record  of  lockholders  and  thus  need  to  respond  to  claim  messages,  rejecting 
them  because  of  conflict  or  because  fewer  than  a  majority  of  those  in  the 
caucus  are  currently  able  to  communicate.  While  this  ensures  reliability  in 
the  presence  of  network  pajtitioning,  or  failure  of  caucus  members,  it  does 
not,  of  course,  guajantee  against  failure  of  the  node  holding  the  lock. 

A  similar  protocol  permits  updates  and  commitment  in  a  replicated 
databeise.  The  caucus  is  composed  of  the  set  of  nodes  holding  copies  of 
the  data  in  question.  Updates  are  performed  by  a  single  broadcast  mes¬ 
sage  conveying  the  update  request  and  whatever  additional  timestamps  are 
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needed  by  the  conflict  detection  algorithms,  which  unfortunately  are  not 
themselves  simplified  by  the  broadcasting.  After  a  delay  during  which  the 
update  can  be  rejected  by  rezison  of  conflict  or  lack  of  a  majority,  it  is 
automatically  committed.  The  protocol  also  provides  for  reading  data  reli¬ 
ably  from  the  database,  for  readmitting  a  failed  node  (particularly  a  cauciis 
member)  and  for  rejoining  a  partitioned  network. 


3.9  Conclusions 

Aside  from  intellectuzil  interest,  the  utility  of  such  protocols  depends  on 
the  cost  and  speed  ratios  for  processing  and  communication,  the  load  on 
the  communication  medium,  the  nature  of  the  traffic,  and  the  effect  of  the 
delays  required  by  the  broadcast  protocols. 
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Part  IV 

Extending  Interval  Logic  to 
Real  Time  Systems 
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4.1  Abstract 


! 

Interval  logic  is  a  temporal  logic  that  provides  a  higher-level  framework 
for  specifying  distributed  systems.  The  concepts  of  intervals  and  inter¬ 
val  composition  form  the  basic  structure  of  many  specifications.  Interval 
logic  allows  such  conceptual  requirements  to  be  stated  rather  directly  and 
intuitively. 

Temporal  logic  hzis  suffered  from  its  orientation  towards  eventuality 
rather  than  immediacy  in  real  time;  indeed,  pure  temporal  logic  makes  no 
reference  to  time!  There  are  many  real  time  properties  that  are  critical 
to  the  specification  of  distributed  systems.  We  have  been  able  to  extend 
interval  logic  to  allow  real  time  bounds  on  intervals  and  to  allow  events  to 
be  defined  by  real  time  offsets  from  other  events.  The  extension  is  clean 
and  sufficient  to  describe  real  time  constraints  directly  and  easily. 

The  interval  logic  is  demonstrated  by  application  to  the  lift  specification 
example. 


4.2  Introduction 


Temporal  logic  has  been  found  useful  for  specifying  distributed  asynchronous 
systems.  Traditionally,  such  specifications  have  been  expressed  as  interact¬ 
ing  state  machines,  but  that  approach  inevitably  suffers  from  over  specifica¬ 
tion  for  the  state  machines  represent  2in  implementation.  If  the  application 
is  such  that  only  one  implementation  is  envisaged,  an  implementation  ori¬ 
ented  specification  may  be  acceptable;  but  other  applications,  for  example 
communications  protocol  specifications,  envisage  many  distinct  implemen¬ 
tations.  By  specifying  the  minimum  required  externally  visible  behavior, 
leaving  all  other  aspects  to  lower  levels  of  description,  one  can  be  obtain 
a  more  general  specification  that  reflects  the  necessary  requirements  of  the 
distributed  system  or  protocol.  A  specification  that  is  oriented  towards 
one  implementation  may  discourage  or  even  preclude  other  equally  valid 
implementations.  Specifications  expressed  in  temporal  logic  do  not  suffer 
as  severely  from  implementation  bias  as  do  state  machine  specifications. 

A  specification  for  a  distributed  system  can  serve  to  define  the  exter¬ 
nally  observable  function  of  the  system,  in  effect  the  service  provided  by 
the  system.  Such  specifications  are  called  service  specifications.  A  service 
specification  regards  the  entire  distributed  system  as  a  single  entity,  with 
multiple  interfaces  at  separate  nodes  of  the  distributed  system.  The  specifi¬ 
cation  defines  how  operations  at  each  interfeice,  performed  asynchronously, 
affect  results  at  other  interfaces.  Ideally,  a  service  specification  defines 
only  the  behavior  visible  at  the  external  interfaces,  without  suggesting  any 
internal  structure  for  the  system. 

Many  service  specifications  define  that  all  operations  at  the  external 
interfaces  be  serializable,  a  characteristic  that  is  often  desirable  for  user 
interfaces.  Such  specifications  can  often  be  expressed  with  simpler  specifi¬ 
cation  iai*guages  that  provide  only  the  concepts  of  parallel  operation  and 
of  atomicity. 
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Alternatively,  a  specification  can  define  the  manner  in  which  the  sep>- 
arate  components  of  the  distributed  system  interaict  *vith  each  other  so 
as  to  provide  the  required  function.  Such  a  specification  is  called  an  im¬ 
plementation  specification  or  a  protocol  specification.  An  implementation 
specification  defines  separately  the  behavior  of  each  component,  so  that 
each  distributed  component  can  be  implemented  separately.  The  specifi¬ 
cations  describe  how  the  components  communicate  with  each  other  using 
a  communication  facility,  which  is  defined  by  a  service  specification,  as  is 
shown  in  Figure  4.1.  The  commimication  facility  is,  of  course,  itself  a  dis¬ 
tributed  system  for  which  there  is,  in  addition  to  the  service  specification, 
also  an  implementation  specification  dependent  on  an  even  more  primitive 
communication  mechanism.  In  many  distributed  systems,  the  hierzirchy  of 
such  specifications  is  several  levels  deep. 

If  there  are  to  be  several  independent  implementations  of  some  of  the 
components,  in  the  future  even  if  not  immediately,  it  is  important  that  the 
implementation  specification  describe  only  how  the  components  interact 
with  each  other  without  unnecessarily  constraining  the  internal  implemen¬ 
tation  of  any  component.  The  ideal  specification  is  one  in  which 

•  any  component,  that  satisfies  its  specification,  will  operate  satisfac¬ 
torily  in  the  system,  and 

•  any  component,  that  operates  satisfactorily  in  the  system,  will  satisfy 
the  specification. 

If  both  a  service  specification  and  an  implementation  specification  have 
been  constructed  for  a  distributed  system,  it  is  possible  to  validate  the  im¬ 
plementation  specification  by  confirming  that  it  satisfies  the  service  speci¬ 
fication.  This  ability  is  very  valuable  for  the  implementation  specification 
is  often  quite  complex  and  prone  to  error,  while  the  service  specification 
is  much  shorter  and  simpler.  Unfortunately,  the  current  state  of  the  art, 
and  particularly  of  tools,  has  not  yet  advanced  to  the  point  at  which  such 
a  validation  is  feasible  for  typical  distributed  systems. 
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Figure  4.1:  Specification  of  a  Level  in  the  Protocol  Hierarchy. 


4.3  The  Basic  Interval  Logic 


In  a  previoiis  survey  paper[lO],  we  examined  how  several  different  temporal 
logic  approaches  express  the  conceptual  requirements  for  a  simple  protocol. 
Our  conclusions  were  both  disappointing  and  encouraging.  On  one  hand, 
we  saw  how  the  very  abstract  temporal  requirements  provided  an  elegant 
statement  of  the  minimal  behavior  for  an  implementation  to  conform  to 
the  specification.  We  were  able  to  distill  a  set  of  requirements  express¬ 
ing  the  essence  of  the  desired  behavior,  stating  only  requirements  without 
implementation-constraining  expedients. 

While  we  were  happy  with  the  level  of  conceptualization  of  the  specifi¬ 
cations,  their  expression  in  temporal  logic  was  rather  complex  amd  difficult 
to  imderstand.  The  relatively  low  level  of  the  linear-time  temporal  logic 
operators  encourages  the  inclusion  of  additional  state  components  that  are 
not  properly  part  of  the  specification,  but  that  help  to  establish  the  context 
necessary  to  express  the  requirements.  Without  these  components,  context 
can  only  be  achieved  by  complex  nestings  of  temporal  until  constructs  to 
establish  a  sequence  of  prior  states.  The  survey  paper  showed  how  the 
introduction  of  state  simplifies  the  temporal  logic  formulas  at  the  expense 
of  increasing  the  amount  of  “mechanism”  in  the  specification.  The  specifi¬ 
cation  that  defined  only  the  minimum  required  externally  visible  behavior, 
without  any  additional  internal  state  components,  was  also  the  least  read¬ 
able,  As  a  result  of  this  survey,  the  interval  logic  was  developed  to  allow 
the  specification  of  distributed  systems  in  a  manner  that  corresponds  more 
closely  to  the  intuitive  intent  and  imderstanding  of  the  designers. 

At  the  heart  of  our  interval  logic  are  formulas  of  the  form: 

[/]a 

Informally,  the  meaning  of  this  is:  “The  next  time  the  interval  I  can  be 
constructed,  the  formula  a  will  ‘hold’  for  that  interval.”  This  interval 
formula  is  evaluated  within  the  current  interval  context  and  is  vacuously 
satisfied  if  the  interval  I  cannot  be  found.  A  formula  ‘holds’  for  an  interval 


if  it  is  satisfied  by  the  interval  sequence,  with  the  present  state  being  the 
beginning  of  the  interval. 

The  unary  □  and  O  temporal  logic  operators  retain  their  intuitive 
mejining  within  interval  logic.  The  formula  [  7  ]□  a  requires  that  property 
a  must  hold  throughout  the  interval,  while  [  I  ]0>Q:  expresses  the  property 
that  sometime  during  the  interval  7,  a  must  hold.  For  simple  state  predicate 
P,  the  interval  formula  [  7  ]  P  expresses  the  requirement  that  P  be  true  in 
the  first  state  of  the  interval. 

Interval  formulas  compose  with  the  other  temporal  operators  to  derive 
higher-level  properties  of  intervals.  The  formula 

[I][J]a 

states  that  the  first  J  interval  contained  in  the  next  I  interval,  if  found,  will 
have  property  a.  The  property  that  all  J  intervals  within  interval  7  have 
property  a  would  be  expressed  as  [  7  ]  □[  J  ]a.  More  globally,  the  formula 
□[  7  ]a  requires  all  further  7  intervals  to  have  property  a. 

Each  interval  formula  [  7  ]  ot  constrains  a  to  hold  only  if  the  interval 
7  can  be  foimd.  Thus  only  when  the  context  can  be  established  need  the 
interval  property  hold.  To  require  that  the  interval  occur,  one  could  write 
->  [  7  jFzdse.  The  interval  language  defines  the  formula  *  7  to  mean  exactly 
this. 

Thus  far,  we  have  described  how  to  compose  properties  of  intervals 
without  discussing  how  intervals  are  formed.  At  the  heart  of  a  very  general 
mechanism  for  defining  and  combining  intervals  is  the  notion  of  an  event. 
An  event,  defined  by  an  interval  formula  0,  occurs  when  0  changes  from 
False  to  True,  i.e.,  when  it  becomes  true.  In  the  simplest  case,  /?  is  a 
predicate  on  the  state,  such  zis  z  >  5  or  at  Dq  .  Note  that,  if  the 
predicate  is  true  in  the  initial  state,  the  event  occurs  when  it  changes  from 
False  to  True,  and  thus  only  after  the  predicate  has  become  False. 

Intervals  are  defined  by  a  simple  or  composed  interval  term.  The  prim¬ 
itive  interval,  from  which  all  intervals  are  derived,  is  the  event  interval.  An 
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event,  defined  by  /?,  denotes  the  interval  of  change  of  length  2  containing 
the  and  ^  states  comprising  the  change.  Pictorially,  this  is  represented 
as 


I  0 

event  /? 


1 


Two  functions,  Mon  and  «nd,  operate  on  intervals  to  extract  unit  inter¬ 
vals.  For  interval  term  /,  b«fore/  denotes  the  unit  interval  containing  the 
first  state  of  interval  I.  Similarly,  end/  denotes  the  unit  interval  at  the  end. 
Application  of  the  end  fimction  is  undefined  for  infinite  intervals.  Again, 
pictorially,  the  intervals  selected  are 


I 

beforel 


[ _ ^ ^ ^ _ ] 

[ 

endl 

[— ] 

For  a  P  predicate  event,  the  following  formulzis  eire  valid. 

[endP  ]P 

[  before  P  j  P 

[P]^P 

4.3.1  The  Interval  Operators  and  <= 

Two  generic  operators  exist  to  derive  intervals  from  interval  arguments. 
We  take  the  liberty  of  overloading  these  operators  to  allow  zero,  one  or  two 
interval-value  arguments.  Intuitively,  the  direction  of  the  operator  indicates 
in  which  direction  zind  in  which  order  the  interval  endpoints  are  located. 
The  endpoint  at  the  tail  of  the  arrow  is  first  located,  followed  by  a  search 
in  the  direction  of  the  arrow  for  the  second  endpoint.  A  missing  parameter 
causes  the  related  endpoint  to  be  that  of  the  outer  context. 
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The  interval  term  I  ^  denotes  the  interval  commencing  at  the  end  of 
the  next  interval  I  and  extending  for  the  remainder  of  the  outer  context. 
The  right  arrow  operator,  in  effect,  locates  the  first  I  interval,  relative  to 
the  outer  context,  and  forms  the  interval  from  the  tnd  of  that  I  interval 
onward.  With  only  a  second  argument  present,  ^  J  denotes  the  interval 
commencing  with  the  first  state  of  the  outer  context  and  extending  to  the 
end  of  the  first  J  interval.  Thus, 


The  term  /  =>  J,  with  two  interval  arguments,  represents  the  compo¬ 
sition  of  the  two  definitions.  This  constructs  the  interval  starting  at  the 
end  of  interval  I  and  extending  to  the  end  of  the  next  interval  J  located 
in  the  interval  I  =».  Given  this  definition,  the  interval  formula  [  /  =»  J  ]  a 
is  equivalent  to  =>  J  ]  a.  Recall  that  the  formula  [  I  ^  J  ]  a 

is  vacuously  true  if  the  I  =>  J  interval  caimot  be  found.  Pictorially,  the 
interval  selected  is 


The  right  arrow  operator  with  no  interval  arguments  selects  the  entire 
outer  context. 

The  left  arrow  operator  is  defined  analogously.  For  interval  term 
J  <=  J,  the  first  J  interval  in  context  is  located.  From  the  end  of  this  J 
interval,  the  most  recent  I  interval  is  located.  The  derived  interval  I  <=  J 
begins  with  end/  and  ends  with  andJ.  Thus, 
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I  I  J 

Similarly,  the  interval  term  /  -4=  selects  the  interval  beginning  with  the  end 
of  the  last  I  interval  and  extending  for  the  remainder  of  the  context.  For 
a  context  in  which  an  interval  /  occurs  an  infinite  number  of  times,  the 
formula  [  /  -<^  ]  a  is  vacuously  true.  The  interval  terms  ■<=  and  J  are 
strictly  equivalent  to  =>  and  ^  J,  respectively. 

The  following  examples  illustrate  the  \ise  of  the  interval  operators. 

[x  =  j/  y  =  16]nx>z  (4.1) 

— ■' - ^ 

X  =  y  y  =  16 

For  the  interval  beginning  with  the  next  event  of  the  variable  x  becoming 
equal  to  y  and  ending  with  y  changing  to  the  value  16,  the  value  of  x  is 
asserted  to  remain  greater  than  z.  The  first  state  of  the  interval  is  thus 
the  state  in  which  x  is  equal  to  y  and  the  laist  state  is  that  in  which  y  is 
next  equal  to  16.  Note  that  the  events  x  =  y  and  y  =  16  denote  the  next 
changes  from  x  ^  y  and  y  #  16. 

To  modify  the  above  requirement  to  allow  x  >  z  to  become  False  as  y 
becomes  16,  one  could  write 

[  X  =  y  ^  before(y  =  16)]  □  X  >  2  (4-2) 

Nesting  interval  terms  provides  a  method  of  expressing  more  compre¬ 
hensive  context  requirements.  Consider  the  formula 

[{A^  B)^C]0D  (4.3) 
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OD 

k  i 

- 

A  B  C 

The  formula  requires  that,  if  an  A  event  is  found,  the  subsequent  B  to  C 
interval,  if  found,  must  sometime  satisfy  property  D.  The  outer  =»  operator 
selects  the  interval  commencing  at  the  end  of  its  first  argument,  in  this  case, 
at  the  end  of  the  selected  A  =>  B  interval.  The  interval  then  extends  until 
the  next  C  event  -  establishing  the  necessary  context, 

In  the  previous  example,  the  formula  was  vacuously  true  if  any  of  the 
events  A,J5,  or  C  could  not  be  fotmd  in  the  established  context.  In  order 
to  easily  express  a  requirement  that  a  particular  event  or  interval  must  be 
found  if  the  necessary  context  is  established,  we  introduce  an  interval  term 
modifier  *.  For  interval  term  /,  *I  adds  an  additional  requirement  that  B 
must  be  found  in  the  designated  context.  The  formula 

[  (A  =>  *  B)  =>  C  jO'D  (4A) 

strengthens  formula  (3)  by  adding  the  requirement  that,  if  an  A  event 
occurs,  a  subsequent  B  event  must  occ'or.  This  is  equivalent  to  formula  (3) 
conjoined  with  [A  =>  ]  *B. 

The  *  modifier  can  be  applied  to  an  arbitrary  interval  term.  The  formula 
[  *  (A  ^  B)  =>  C  for  example,  would  be  equivalent  to  (3)  conjoined 

with  *(A=»B),  or  equivalently,  ♦A  A  [  -^  The  *  modifier 

adds  only  linguistic  expressive  power  and  can  be  eliminated  by  a  simple 
reduction  (given  in  the  Appendix). 

As  an  example  of  specifying  context  for  the  end  of  the  interval,  consider 
the  formula 

[  A  =J>  (B  =>  C)  ]0^  (4.5) 

[ _ ^ 

- >1 - 

A  B  C 
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Here,  the  interval  begins  with  the  next  occurrence  of  A  and  terminates  with 
the  first  C  that  follows  the  next  B. 

By  modifying  formula  (3)  to  begin  the  interval  at  the  beginning  of 
A  B,  i.e., 

[  b*for*(A  =>  B)  =>  C  ]0-D  (4.6) 

[ _ 

- - - 

A  B  C 

we  obtain  a  requirement  similar  to  that  of  (5),  but  allowing  events  B  and 
C  to  be  arbitrarily  ordered. 

Introducing  the  use  of  backward  context,  to  find  the  interval  A  =>  B  in 
the  context  of  C,  we  have 


[{A^  B)  ^C]OD  (4.7) 


A  B  C 

Here  the  occurrence  of  the  first  C  event  places  an  endpoint  on  the  context, 
within  which  the  most  recent  A  =>  B  interval  is  found.  Note  the  order  of 
seairch:  looking  forward,  the  next  C  is  found,  then  backward  for  the  most 
recent  A,  then  forward  for  the  next  B.  Thus,  the  formula  is  vacuously  true 
if  no  B  is  found  between  C  and  the  most  recent  A. 

As  a  last  example,  consider 

[  b,fore(A  ^  B)  <=C]OD  (4.8) 
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A  B  C 

The  interval  extends  back  from  the  first  C  event  to  the  beginning  of  the 
most  recent  A  <=  B  interval. 


4.3.2  Parameterized  Operations 

Within  the  language  of  our  interval  logic  we  include  the  concept  of  an  ab¬ 
stract  operation.  For  an  abstract  operation  O,  state  predicates  atO,  inO, 
and  afterO  are  defined.  These  predicates  carry  the  intuitive  meanings  of 
being  “at  the  beginning”,  “within”,  and  “immediately  adter”  the  opera¬ 
tion.  Formally,  we  use  the  following  temporal  axiomatization  of  these  state 
predicates. 

1.  [  atO  =►  b«fon  aftcrO  J  QinO 

2.  [  alitrO  =>  before  atO  J  O  "'inO 

3.  [  -latO  anerO  ]  □  “'atO 

4.  [  -lafterO  =>  atO  ]  □  “'afterC? 


Axioms  1  <ind  2  together  define  inO  to  be  true  exactly  from  atO  to  the  state 
immediately  preceding  afterO.  Axiom  3  allows  atO  to  be  true  only  at  the 
beginning  of  the  operation,  and  axiom  4  requires  that  afterO  be  true  only 
immediately  following  an  operation.  Note  that,  in  axiom  1  for  example, 
the  predicate  atO  used  as  an  event  term  defines  the  interval  commencing 
with  the  entry  to  the  operation. 

The  axioms  do  not  imply  any  specific  granularity,  duration  or  mapping 
of  the  operation  symbol  to  an  implementation.  Any  interpretation  of  these 
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state  predicate  symbols  satisfying  the  above  axioms  is  allowed.  In  addition, 
no  assumption  of  operation  termination  is  made.  To  require  an  operation 
to  always  terminate,  one  could  state  as  an  axiom 

[  atO  =>  *  afiarO  ]  True 

Abstract  operations  may  taike  entry  and  result  parameters.  For  an  op- 
eration  taking  n  entry  parameters  of  types  Ti, . . . ,  Tn,  and  m  result  param¬ 
eters  of  tjrpes  •  •  •  »  ^n+mt  the  at  and  after  state  predicates  are  overloaded 
to  include  paraimeter  values.  atO(t;i,...,t?„)  is  true  in  any  state  in  which 
atO  is  true  and  the  values  of  the  parameters  are  vi, . . . ,  The  predicate 
after  is  similarly  overloaded. 

As  an  example  of  an  interval  requirement  involving  parameterized  oper¬ 
ations,  consider  an  operation  O  with  a  single  entry  parameter.  To  require 
that  this  parameter  increase  monotonically  over  the  call  history,  one  could 
state 

Va, 6  □[atO(a)  =>  atO(6)  ]  6  >  o 

Since  a  and  b  are  free  variables,  for  all  a  and  b  such  that  we  can  find  an 
interval  commencing  with  an  atO(a)  and  ending  with  an  atO(b),  b  must  be 
greater  than  o.  Recall  that  the  formula  is  vacuously  true  for  any  choice  of 
a  and  6  such  that  the  interval  cannot  be  found. 

It  is  also  useful  to  be  able  to  designate  the  next  occurrence  of  the  oper¬ 
ation  call,  and  to  bind  the  parameter  values  of  that  call.  The  event  term 
atO  :  (a)  designates  the  next  event  atO  and  binds  the  free  variable  a  to 
the  value  of  the  parameter  for  that  call.  Thus  the  previous  requirement 
constraining  all  pairs  of  calls,  can  be  restated  in  terms  of  successive  calls  as 

□  [  atC?(a)  =>  atO  :  (6)  ]  6  >  a 

The  requirement  is  now  that  for  every  a,  the  call  atO(a)  is  followed  by 
a  call  of  O  whose  parameter  is  greater  than  a.  This  parameter  binding 
convention  has  a  geneial  reduction,  which  we  omit  here.  For  this  specific 


formula,  the  reduction  gives 

□  [  atO(a)  =>  ]  (  [endatO  ]atO(6)  )  D  [  =>  atO  ]6  >  O 

4.4  Some  Example  Specifications 

Consider  a  queue  with  two  operations,  Enq  which  takes  a  single  parameter 
value,  which  it  enqueues,  and  Dq  which  removes  the  value  at  the  front  of  the 
queue  zind  returns  that  value  as  its  result.  We  assume  in  this  specification 
that  the  queue  is  unbounded,  and  require  that  values  enqueued  must  be 
distinct.  No  assumptions  are  made  about  the  atomicity  of,  or  temporal 
relationships  between,  the  Enq  and  Dq  operations.  These  operations  can 
overlap  in  an  arbitrary  manner.  We  do  assume  that  at  most  one  instance 
of  the  Enq  and  Dq  operations  will  be  active  at  any  given  time. 

The  specification  expresses  the  fundamental  first-in  first-out  behavior 
that  characterizes  a  queue.  It  requires  that,  for  all  a  and  b,  if  we  dequeue 
6,  then  any  other  value  a  will  be  dequeued  in  the  interim  if  euid  only  if  it 
was  enqueued  prior  to  b.  Further  axioms  are  needed  to  express  liveness 
requirements  on  the  two  operations. 


Queue.  [ -^  aft«rDq(6)  ](*afterDq(a)  =  *(atEnq(a)  <=  atEnq(6)  )) 

As  a  second  example,  consider  a  specification  to  ensure  exclusive  access 
to  a  shared  critical  section  by  some  set  of  processes.  Each  process  is  to  make 
an  independent  decision  based  on  a  shared  global  data  structure.  In  stating 
the  specification,  we  Jissume  a  state  predicate  cs(t)  which,  for  process  t, 
indicates  that  i  is  in  the  critical  section.  For  a  shared  global  data  structure, 
we  assume  a  state  oredicate  x(i)  which,  for  process  i,  indicates  t  ’s  intention 
to  enter  the  critical  section.  We  wish  to  state  minimal  requirements  on  the 
use  of  state  predicate  x  by  a  process  to  ensure  mutual  exclusion.  Pictorially 
we  represent  the  required  behavior  as  follows: 
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vy  #  i  □x(i) 
_ O-'xli) 


*x(i)  cs 

As  shown,  an  entry  of  the  critical  section  by  process  :  must  be  preceded 
by  an  earlier  setting  of  x(t)  to  true.  Throughout  this  interval  x(t)  must 
remain  true,  and,  for  every  other  process  j,  there  must  be  some  moment 
within  the  interval  at  which  x(y)  is  false.  This  specification  imposes  no 
requirement  on  the  order  or  frequency  of  inspecting  the  x(y)s;  it  suffices 
that,  at  some  time  during  the  interval,  each  x(y)  is  false.  Herein  lies  the 
basic  reason  for  exclusion.  x(t)  remains  true  through  the  interval,  and  no 
other  x(y)  can  be  true  for  that  interval.  Thus  no  other  process  j  can  find 
x(j)  false  between  the  time  that  i  signals  his  intention  and  the  time  that  i 
leaves  the  critical  section  (or  abajidons  his  claim).  The  specification  does 
not,  however,  ensure  the  absence  of  deadlock. 

In  interval  logic,  we  express  these  requirements  as  follows. 

Init.  Vm  -'x(m) 

Al.  i^j  D  [x(i)  <=  cs(f)  ]<C>-'x(y) 

A2.  cs(i)  D  x(f) 


Given  an  initial  condition  in  which  all  processes  have  relinquished  their 
claims,  axiom  Al  expresses  our  previous  pictorial  requirement  that,  if  pro¬ 
cess  t  enters  the  critical  section,  then  for  the  interval  back  to  the  most 
recent  setting  of  x(j),  each  x(j)  must  be  found  to  be  false.  Axiom  A2  re¬ 
quires  that  x(i)  remains  true  while  i  is  in  the  critical  section.  We  have  not 
needed  to  state  explicitly  that  there  must  be  a  setting  of  x(i)  prior  to  the 
entry;  this  is  deducible  from  the  specification.  Similarly  we  can  deduce  that 
x(:)  remains  true  through  that  interval. 

From  this  specification,  we  can  demonstrate  (omitted  here)  the  mutual 
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exclusion  property  that  henceforth  no  pair  of  processes  can  both  be  in  the 
critical  section  at  the  same  time,  i.e., 

Vm  -'x(m)  A  »  #  i  3  □  -•(  cs(i)Acs(7)  ) 

4.5  Real  Time  Extensions 

Temporal  logic  has  suffered  from  its  orientation  tov/ards  eventuality  rather 
than  immediacy  in  real  time;  indeed,  pure  temporal  logic  makes  no  reference 
to  time!  A  temporal  logic  specification  defines  only  invariants,  eventuality, 
and  order  constraints  on  the  sequence  of  states  resulting  from  the  execution 
of  the  distributed  system  without  reference  to  when  the  states  actually  oc¬ 
cur.  But  the  specification  of  distributed  systems  typically  depends  criticlly 
on  the  specification  of  real  time  properties. 

Surprisingly,  in  view  of  the  orientation  of  temporal  logic  towards  even¬ 
tuality,  there  are  useful  eventuality  properties,  superficially  independent 
of  real  time,  that  cannot  be  written  without  reference  to  real  time.  For 
example,  the  service  specification  for  a  lift,  without  consideration  of  the 
possibility  of  lift  failure,  can  be  expressed  as  a  requirement  that  if  a  re¬ 
quest  is  made  for  floor  a  then,  eventually,  the  lift  will  be  at  floor  a  with  the 
door  open. 

□  (  Request  (a)  D  <C>  atfloor(a)  A  dooropen(a)  ) 

Unfortunately,  any  practical  lift  inevitably  has  occasions  when  it  is  out 
of  service,  expressed  as 

□  O  ■'inservice. 

If  we  are  to  avoid  expedients  such  as  regarding  an  out  of  service  state  as  a 
terminal  state,  or  of  r?quiring  that  the  lift  remember  the  request  for  floor  a 
through  the  out  of  service  state  (an  unreasonable  requirement),  we  would 
like  to  modify  the  service  specification  to  state  that  the  lift  will  eventually 
be  at  floor  o  unless  it  goes  out  of  service  first.  There  is  no  way  to  express 
that  requirement;  the  best  that  can  be  achieved  is 
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□  ^  request(a)  D  O  ((atfloor(a)  A  <iooropen(a))  V  “’inservice)  ) 

Careful  examination  shows  that  this  specification  is  completely  satisfied  by 
the  eventual  out  of  service  condition  and  it  thus  contributes  nothing  to  the 
requirement  that  a  request  be  serviced  by  moving  to  the  requested  floor. 
In  effect,  the  lift  can  satisfy  the  specification  doing  nothing  but  wait  until 
it  breaJcs. 

To  overcome  this  problem,  we  must  place  a  real  time  bound  on  the 
period  of  time  throughout  which  the  lift  must  be  operational  to  guarantee 
that  the  service  will  be  provided.  The  service  specification  then  becomes 

□  [  request  (a)  =>•  request  (a) +max_service_time  | 

□  inservice  D  <C>(  atfloor(a)  A  dooropen(a)  ) 

This  states  that  for  an  interval  commencing  with  the  request  and  of  length 
maxjservice.time,  if  the  lift  is  never  out  of  service  during  the  interval  then 
the  service  will  be  provided  within  that  interval. 

Thus  we  need  to  extend  interval  logic  to  include  real  time  constraints, 
but  we  do  not  want,  in  so  doing,  to  destroy  what  is  valuable  about  the 
logic.  Temporal  logics  are  valuable  becaiise  they  allow  the  expression  of 
necessary  properties  while  precluding  other  forms  of  expression  that  would 
be  inappropriate.  For  example,  if  time  is  represented  explicitly  as  a  numeric 
variable  in  our  specifications,  it  is  possible  to  express  any  useful  temporal 
property,  including  those  involving  real  time  constraints.  But,  the  explicit 
representation  of  time  mcikes  possible  expressions  that  have  no  meaning, 
such  as  those  in  which  a  property  depends  on  whether  the  time  is  even  or 
odd!  Thus  the  extension  must  not  expose  the  numeric  nature  of  time. 

Further,  temporal  logics  mask  quantifications  over  time.  An  explicit 
representation  of  time  could  require  that  those  temporal  quantifications 
be  explicit,  complicating  both  the  formulae  and  also  deduction  involving 
the  formulae.  If  it  possible  to  hide  the  quantifiers,  and  to  process  them 
automatically  during  deduction,  as  it  is  with  temporal  logics,  we  should 
try  to  do  so. 


The  interval  logic  can  be  extended  to  include  real  time  by: 

•  imposing  real  time  bounds  on  the  length  of  intervals,  and 

•  allowing  events  to  be  defined  by  real  time  offsets  from  other  events. 

Defining  events  by  real  time  offsets  is  achieved  by  two  new  operators 
+,—  syntactically  defined  by 

;  event  x  duration  constant  — ►  event. 

Thus  if  E  is  an  event  then  so  are  E+1  second  and  E—l  day. 

Boimds  on  the  length  of  intervals  are  provided  by  two  relational  opera¬ 
tors,  syntactically  defined  by 

>,  <:  duration  constant  — ►  boolean. 

These  relational  operators  are  monadic  because  they  relate  the  length  of 
the  enclosing  context  to  the  duration  constant.  Used  within  an  interval, 
they  therefore  relate  the  length  of  that  interval  to  the  constant.  Thus,  if 
/  is  an  interval,  [/]  <1  second  is  a  booleam  predicate  on  the  length  of  that 
interval.  Similarly,  we  might  write  [I\  >10  seconds  A  O  x=4. 

The  relational  operators  can  be  derived  form  the  event  constructors 
by  defining  a  event  offset  from  the  start  of  the  interval  and  determining 
whether  that  event  lies  within  the  interval.  However,  the  availability  of  the 
relational  operators  adds  directness  and  clarity  to  the  specifications. 

These  extensions  to  interval  logic  axe  clean  and  appear  sufficient  to  de¬ 
scribe  almost  all  real  time  constraints  directly  and  easily.  They  do  not 
permit  the  construction  of  undesired  expressions  in  which  time  is  manipu¬ 
lated  inappropriately. 

The  decidability  of  interval  logic  is  unaffected  by  these  extensions.  It 
is  not  appropriate  to  digress  here  into  a  lengthy  analysis  of  decidability, 
but  rather  we  give  only  a  brief  outline  of  the  necessary  extensions  to  the 
decision  process.  A  decision  procedure  for  interval  logic  can  be  constructed 
as  a  standard  semantic  tableau,  building  a  graph  of  possible  states.  The 
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transitions  between  states  are  determined  by  the  order  of  events,  and  thus 
the  predicates  on  the  states  comprise  the  conjunction  of  the  normal  state 
predicates  with  a  set  of  relations  on  the  order  of  events. 

To  extend  this  semantic  tableau  decision  process  to  the  real  time  ver¬ 
sion  of  interval  logic,  the  real  time  relational  operators  are  first  reduced 
to  terms  involving  event  constructors,  as  described  above.  The  semantic 
tableau  procedure  is  applied,  as  before,  but  order  relations  on  events  are 
regarded  as  linear  inequalities  in  a  real  number  domain,  and  real  time  event 
constructors  are  replaced  by  arithmetic  operations  in  that  domain.  Linear 
arithmetic  and  linear  inequalities  in  a  real  number  domain  aie  decidable  by 
a  Presburger  procedure,  thus  maintaining  the  decidability  of  the  logic. 


4.6  The  Lift  Example 

The  objective  of  the  Interval  Logic  specification  is  to  express  precisely  and 
formally  the  behavior  required  from  the  lift.  It  is  also  an  objective  to  ex¬ 
press  as  few  constraints  on  that  behavior  as  possible  while  still  ensuring 
correct  behavior.  It  is,  perhaps,  easier  to  provide  a  specification  that  de¬ 
scribes  the  lift  in  minute  and  mechanistic  detail,  but  to  do  so  precludes, 
or  at  least  makes  much  less  obvious,  many  valid  implementations  that  are 
structured  rather  differently.  Our  specification,  indeed,  permits  quite  a 
wide  range  of  behaviors;  lifts  that  demonstrate  some  of  the  less  obvious, 
but  still  permissible,  strategies  can  be  found  in  operation  on  occasion. 

Floors 

The  floors  are  0  to  n,  and  the  lift  will  not  go  outside  this  range.  There 
can  be  no  down  button  on  floor  0,  and  no  up  button  on  floor  n. 

1.  -’atfloor(— 1)  A  ~’atfloor(n  1) 

2.  -'light(0,  down)  A  ~'llgfit(n,  up) 

3.  -'request(0,  down)  A  -request (n,  up) 
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The  lift  is  at  only  one  floor  at  a  time  and  moves  only  to  adjacent  floors. 

4.  b  ^  a  f\  atfloor(a)  D  -'atfloor(6) 

5.  [atfloor(a)^b«for«(atfloor(a  +  1)  V  atfloor(a  —  1))]  □atfloor(a) 

Derived  Predicates 

To  siTrplify  the  specifications,  we  introduce  a  derived  event  newrequest, 
since  requests  are  significant  only  if  there  is  not  already  an  outstanding 
request,  if  the  lift  does  not  already  have  its  doors  open  at  the  requested 
floor,  and  if  the  lift  is  in  service. 

6.  newrequest{a,  dir)  =  request(a,  dir) 

A  -<light(a,  dir)  A  closed(a)  A  inservice 

We  also  introduce  an  auxiliary  event  decision  to  represent  the  moment 
at  which  the  lift  decides  what  to  do  next.  The  event  decision(a)  occurs 
sometime  after  the  doors  open  and  before  the  lift  leaves  at  that  floor.  If 
the  lift  does  not  stop  at  floor  a,  the  event  occurs  some  time  between  being 
at  floor  a  and  not  being  at  floor  a.  Note  that,  at  the  time  of  the  event 
decision(a),  atfloor(a)  must  still  be  true. 

7.  [atfloor(a)=>b«for«-'atfloor(a)]-’ *  open(a)  D  *  decision(a) 

8.  [(atfloor(a)=>-open(a))=>-b«for«-'atfloor(a)]*  decision(a) 

The  predicate  goingup  is  introduced  to  represent  the  decision  made  by 
the  lift  about  which  direction  to  move.  The  predicate  is  true  if  the  next 
floor  that  will  be  visited  is  above  the  current  floor,  and  false  if  it  is  below. 
It  must,  of  course  retain  that  value  until  the  next  decision  point.  The 
curious  option  of  remaining  at  the  same  floor  and  thus  making  a  second 
decision  at  that  floor  is  necessary  in  the  case  that  the  lift  arrives  at  a  floor 
in  response  to  a  request  indicating  continued  travel  in  the  same  direction, 
but  the  request  then  made  inside  the  lift  is  for  travel  in  the  other  direction. 
The  real  time  constraint  is  imposed  to  allow  the  passengers  time  to  enter 
the  lift  and  press  a  button. 
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9.  [((atfloor(a)  A  goingup  =  t;)=^decision(a)) 
=i>-befor«decision  :  (6)]>  min_open_tiine 

A  6  >  a  D  □  goingup 
Ab  <  aD  □  -igoingup 
Ab  =  aD  □  goingup  =  v 


atfloor(a) 

A  goingup=u 


r  >  minopentime 

M  '■  J 


decision(a) 


b«foredecision:(6) 


Lights 

The  lights  axe  used  not  only  to  represent  the  lights  visible  to  the  pas¬ 
sengers,  but  also  to  provide  the  memory  of  pending  requests.  Others  might 
prefer  to  introduce  an  additional  predicate  to  represent  the  pending  re¬ 
quests  explicitly. 

While  out  of  service  the  lights  must  not  be  lit,  and  following  a  return 
to  service  the  lights  must  not  be  lit  until  a  new  request  has  been  made. 

10.  [-'inservice=^^(inservice=>b«forenewrequest(o,  dir))]  □  -'light(a, dir) 

□ -"light (g,  dt'r)  j 

- ^1 

->  inservice  inservice  newrequest(a,  (i’>) 

Three  axioms  defining  when  the  lights  must  not  be  lit  between  the 
satisfaction  of  a  request  and  the  making  of  the  next  request.  The  case  for 
the  lift  light  is  simple,  but  the  other  cases  must  consider  the  direction  of 
motion  of  the  lift  and  also  ensure  that  the  prohibition  applies  from  the  first 
time  that  the  doors  open  at  that  floor. 
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11.  [open(a)=>beforenewrequest(<i,  lift)]  □  -'light(a,  lift) 


□  -'light(a,lift) 

- A - ►! 

open(a)  newrequest(a,  lift) 

12.  [b*for«((atfloor(a)=>open(a))-4=open(a)4=atfloor(a  +  1)) 
=>'b€forenewrequest(a,  up)]  □  -'light{a,  up) 

□  -'Ught(a,up) 


atfloor(a)  open(a)  atfloor(a+l) 

open(a)  newrequest(a,  up) 

13.  [b«for*((atfloor(a)=>open(a))<=open(a)<=atfloor(o  —  1)) 
=^b*for«newrequest(a, down)]  □  -•light(a,  down) 

□ -'light  (g,  down) 


atfloor(a)  open(a)  atfloor(o— 1) 

open(o)  newrequest(a,  down) 

An  axiom  that  defines  when  the  lights  are  required  to  be  illuminated, 
the  lights  can  be  turned  off  as  early  as  the  previous  decision  point,  i.e. 
shortly  before  reaching  the  requested  fioor.  They  can  remain  lit  for  longer 
but  other  cixioms  require  that  they  be  out  at  least  by  the  time  that  the 
doors  are  open  at  the  requested  floor.  The  lights  need  only  remain  on  so 
long  as  the  lift  is  inservice. 
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14.  [newrequest  (a,  dir)  => 

before  (decision  :  (6)-^  decLsion(a)  A  ((diV  =  up  A  goingup)  V 

{dir  =  down  A  “'goingup)  V 
dir  =  lift))] 

□  inservice  D  □light(a,  drr) 

A[=>  “'inservice]  □  light  (a,  dir) 

[  □light(a,  dfV)  j 

newrequest  (a,  dir)  decision  :  (6)  decision(a) 

Movement 

This  axiom  is  a  lift  scheduling  constraint  that  requires  continued  motion 
in  one  direction  so  long  as  there  are  further  requests  outstanding  in  that 
direction.  When  the  lift  decides  to  change  its  direction  of  motion,  i.e.  when 
goingup  changes  from  false  to  true  or  from  true  to  false,  there  must  be  no 
further  request  outstanding  in  the  original  direction  of  motion. 

15.  b<a  D  [beforegoingup=^]atfloor{a)  D -'light(6,  dir) 

16.  6  >  a  D  [b«fore-'goingup=>]atfloor(o)  D  “'light(6,  dtr) 

When  appropriate,  the  lift  will  stop  and  open  its  doors.  Feist  lifts  need 
time  to  decelerate  and  stop,  time  that  is  not  provided  by  this  version  of  the 
specifications.  The  necessary  modifications  do  not  affect  these  two  axioms 
but  rather  impose  a  speed  dependent  adveince  on  the  decision  point  defined 
in  axiom  7. 
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17.  6  >  a  D  [(decision(a)  A  goingup  A  (light(6, up)  V  light(6,  lift))) 

=>->atfloor(6)]*  open(6)  V  *  -^inservice 

,  *-i  inservice  , 

[  V  *  open(6)  j 

- >1 - .1 

decisloii(a)  -iatfloor(6) 

18.  6  <  a  D  [(decision(a)  A  "'goingup  A  (light (6,  down)  V  light(6,  lift))) 

=>-'atfloor(5)]*  open(6)  V  *  -'inservice 

These  requirements  allow  the  wide  range  of  behavior  that  we  encounter 
in  lifts,  as  for  instance  in  allowing  the  lift  to  always  return  to  the  ground 
floor,  in  allowing  the  lift  a  home  floor  when  inactive,  or  even  in  allowing 
the  cattle  car  to  stop  at  every  floor  regardless. 

The  local  liveness  axioms  require  that  lift  should  not  stay  at  one  floor 
indefinately  if  there  are  requests  outstanding  from  other  floors.  The  first 
of  the  two  axioms  constrains  the  doors  to  close  within  a  time  constraint 
if  they  are  not  obstructed.  The  second  requires  timely  movement  to  an 
adjacent  floor  if  the  lift  is  in  service. 

19.  b^a  D  [(open(a)  A  light(6,  dtr))=> 

(open(a)  A  light(5,  dtV))  +  max_open_time] 

□  (inservice  A  "'obstructed(a))  D  *  closing(a) 

*closing(a) 


open(a)  A  light(6,dtr)  open(a)  A  light(6,dir) 

+meLX_open_time 
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20,  b^a  D  [(closed(a)  A  Iight(6,dtr))=^>- 

(closed(a)  A  Iight(6,(iir))  +  movement -time] 

□  inservice  3  (*atfloor(a  +  1)  V  *  atfloor(a  —  1)) 

*atfloor(a  +  l)  V 
*atfloor(a  —  1)  ] 


closed(a)  A  light(6,  dir)  closed(a)  A  Iight(6,  dir) 

+movement_time 


Service  Specification 

We  must  next  provide  our  lift  with  a  service  specification.  Basically, 
the  service  specification  states  that  if  a  request  is  made  for  floor  a,  then 
eventually  the  lift  will  be  at  floor  a  with  the  doors  open.  As  discussed 
above,  we  must  temper  this  idealistic  requirement  with  the  possibility  that 
the  lift  may  go  out  of  service.  We  must  also  allow  for  the  possibility  that 
the  doors  may  be  obstructed  to  prevent  them  from  closing.  We  can  now 
state  an  informal  service  requirement: 

If  a  request  is  made  for  floor  a  by  pressing  a  button  inside  the  lift  or 
at  that  floor,  and  if,  throughout  a  sufficiently  long  interval  commencing 
with  the  request,  the  lift  is  never  out  of  service  and  the  doors  are  never 
obstructed,  the  lift  will  eventually  be  at  floor  a  with  its  doors  open. 

21.  [newrequest(a,  dir)  =>newrequest(a,  dtr)  +  max_service-time] 

□  (inservice  A  “'Obstructed)  =  *  open(a) 

□  (inservice  A  “'obstructed) 

[ _ =  *  open(a) _ ] 


newrequest(a,  dir)  newrequest(a,  dir) 

4-maLX_service-time 

It  is  possible  to  elaborate  this  requirement  to  allow  occasional  obstruc¬ 
tion  of  the  doors  while  still  guaranteeing  service,  but  at  the  cost  of  greatly 
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complicating  the  specification.  The  complexity  arises  not  from  any  inability 
of  the  specification  language  but  from  the  inherent  complexity  of  determin¬ 
ing  to  what  extent  it  is  possible  to  obstruct  the  doors  while  still  requiring 
the  lift  to  provide  timely  service. 

Door  opening  and  closing 

We  now  encounter  a  sequence  of  relatively  simple  axioms  that  closely 
control  the  opening  and  closing  of  the  doors.  Their  interest  lies  largely 
in  the  extent  to  which  real  time  constraints  are  necessary  to  specify  this 
aspect  of  the  lift. 

Opening,  open,  closing,  and  closed  are  complete  and  mutually  exclusive. 

22.  opening(a)  V  open(a)  V  closing(a)  V  closed(a) 

A  (opening(a)  V  open(a))  =  -'(closing(a)  V  closed(a)) 

A  (opening(a)  V  closing(a))  =  -«(open(a)  V  closed(a)) 

23.  [open(a)  =>b«for«closing(a)  ]  □  open(a) 

A  [closed(a)=»be{oreopening(a)]  □closed(a) 

The  lift  must  be  at  a  floor  to  open  its  doors  and  the  doors  of  the  lift 
and  that  floor  open  and  close  together. 

24.  opening(lift)  =  3a:0<a<nA  opening(a) 

25.  0  <  a  <  n  D  [opening(a)=>closed(a)]  □atfloor(a) 

A  opening(lift)  =  opening(a) 
A  open(lift)  =  open(a) 

A  closing(lift)  =  closing(a) 

A  closed(lift)  =  closed(a) 

The  next  five  axioms  place  real  time  constraints  on  the  sequence  of 
opening  and  closing  actions  of  the  doors,  allowing  for  the  possibility  that 
the  doors  may  be  obstructed.  The  last  eociom  states  that  the  doors  are  only 
obstructed  while  closing. 

26.  [opening(a)=>open{a)]  □  inservice  D  <  opening-time 
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27.  [closing(a)=>closed(a)]  □  {inservice  A  -'obstructed(a)) 

D  <  closing-time 

28.  [obstructed(a)=>opening(a)]  Dinservice  D  <  reaction-time 

29.  [(obstnicted(a)=>open(a))=>'closing(a)]  □mservice  D  <  dwelLtime 

30.  [open(a)=>closing(a)]  >  min-open-time 

31.  [obstructed(a)=i>]closing(a) 


4.7  Analysis  and  Conclusions 

We  have  presented  an  outline  of  our  interval  logic  with  an  extension  to 
permit  the  specification  of  real  time  constraints,  aind  have  applied  it  to 
the  lift  specification  example.  We  are  reasonably  satisfied  with  its  success, 
although  we  feel  that  further  experimentation  is  necessary.  Current  work 
is  proceeding  on  providing  the  interval  logic  with  the  ability  to  describe 
multiprocess  systems  and  to  compose  the  specifications  of  single  processes 
into  a  multiprocess  specification.  We  are  also  working  on  integrating  the 
interval  logic  into  the  specification  language  for  a  full  verification  system, 
and  on  verification  techniques  for  concurrent  programs.  Future  projects 
may  investigate  a  human  interface  based  on  the  graphical  representation 
for  interval  logic  rather  than  on  the  linear  syntax. 

We  are  reasonably  satisfied  with  the  style  of  expression  of  the  interval 
logic.  It  appears  to  correspond  quite  closely  to  the  intuitive  forms  of  reason¬ 
ing  and  explanation  used  by  human  designers  while  considering  concurrent 
systems.  In  particular,  the  graphical  representation  for  interval  logic  ap¬ 
pears  to  be  very  close  to  typical  human  design  sketches.  The  behavioral 
style  of  specification  and  the  basing  of  interval  formation  on  events  derived 
from  state  changes,  motivated  by  our  observation  that  establishing  con¬ 
text  almost  always  required  seeing  a  change  in  state,  have  been  justified 
in  our  experience  of  the  use  of  the  logic  for  examples.  But,  despite  the 
relatively  behavioral  style  of  specification,  the  specifications  can  be  quite 
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abstract  with  relatively  little  auxiliary  state  information  introduced  to  es¬ 
tablish  context.  This  allows  specifications  expressed  in  interval  logic  to 
remain  more  general,  and  to  impose  less  implementation  bicis,  than  more 
state  oriented  methods. 

The  real  time  extensions  to  interval  logic  axe  important  for  making  the 
logic  useful  for  the  specification  of  real  systems.  Despite  the  power  of  the 
extension,  we  believe  that  the  integrity  of  the  logic  has  been  maintained. 
That  the  logic  with  the  real  time  extensions  is  still  decidable  is  helpful  in 
retaining  the  opportunity  to  provide  mechanical  support.  Unaided  human 
reasoning  about  concurrent  systems  is  very  falible. 

The  specification  of,  md  reasoning  about,  complex  concurrent  systems 
is  difficult,  and  interval  logic  does  not  eliminate  that  difficulty.  The  dif¬ 
ficulty  is  inherent  in  the  multiplicity  of  possible  cases  that  must  be  con¬ 
sidered,  ajid  in  determining  the  relationships  that  axe  significant  to  the 
operation  of  the  system.  Our  objective  with  interval  logic  can  only  be  to 
allow  the  designer  to  express  his  intentions  aind  understanding  in  a  manner 
that  is  close  to  his  natural  intuition. 
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Part  V 

Consistency  of  Replicated 
Information  in  Multichannel 
Fault  Tolerant  Systems 
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5.1  Abstract 


The  need  for  reliable  computation  has  induced  many  designs  for  fault  tol¬ 
erant  computer  systems  based  on  the  replication  of  the  processors  and 
appropriate  error  detection  and  masking  algorithms.  Typical  of  such  sys¬ 
tems  are  SIFT  and  FTMP,  which  use  majority  voting  for  error  masking, 
and  Stratus,  which  uses  a  dual-dual  structure  for  error  masking.  It  is  clear 
that  these  approaches,  coupled  with  the  steadily  improving  reliability  of 
components,  now  allow  the  construction  of  very  reliable  systems. 

All  fault  tolerant  systems  depend  on  some  form  of  error  masking  al¬ 
gorithm,  coupled  with  error  detection  to  allow  the  repair  of  faults.  Some 
such  systems  depend  on  backward  error  correction,  in  which  a  result  is 
computed,  the  acceptability  of  that  result  is  checked,  and  in  the  event  of 
error  the  computation  of  the  result  is  repeated.  Typical  of  such  systems 
are  classical  Checkpoint-Restart  systems  and  Recovery  Blocks2.  Backward 
error  correcting  algorithms  necessarily  incur  a  significant  overhead  for  re¬ 
peating  the  computation  when  an  error  is  detected,  and  also  involve  an 
acceptance  test  on  the  results,  a  test  that  is  usually  system  and  applica¬ 
tion  specific.  We  do  not  consider  backward  error  correcting  systems  in  this 
paper  but  rather  we  e.xamine  Forward  Error  Correcting  systems,  in  which 
the  results  are  computed  in  a  redundant  form  that  allows  error  masking 
without  repeating  any  computation. 

Two  forward  error  correcting  algorithms  are  currently  used  for  masking 
processor  errors  in  reliable  systems,  majority  voting  and  dual-dual.  The 
majority  voting  approach  can  mask  errors  caused  by  one  faulty  channel  out 
of  three,  while  a  dual-dual  approach  masks  one  faulty  channel  out  of  four. 
Both  approaches  have  the  advantage  that  they  are  completely  application 
independent.  However  majority  voting  and  dual-dual  both  depend  for  their 
operation  on  exact  match  comparison  between  results  of  computations. 
Thus,  for  successful  masking  of  errors,  it  is  essential  that  the  fault  free 
channels  should  generate  identical  results.  Both  algorithms  guarantee,  with 
only  a  single  faulty  channel  and  with  fault  free  channels  producing  identical 
results,  that  fault  free  channels  remain  error  free  and  continue  to  generate 
identical  results. 
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Two  questions  arise  from  this.  The  first  concerns  whether  there  are  any 
single  point  faults  that  could  cause  fault  free  channels  to  generate  differ¬ 
ent  results,  thus  invalidating  the  presumptions  of  both  majority  voting  and 
dual-dual.  We  describe  below  a  class  of  such  faults  and  give  algorithms  for 
precluding  them.  The  second  question  relates  to  the  possible  increase  in  the 
risk  of  common  mode  faults  resulting  from  the  need  for  all  channels  to  per¬ 
form  exactly  the  same  computation  on  identical  data  at  approximately  the 
same  time.  We  show  below  that  error  masking  algorithms  can  be  devised 
that  allow  each  channel  to  perform  a  different  computation  on  different 
data  at  different  times. 

5.2  Loss  of  Consistency 

Figure  I  shows  a  majority  voted  three  channel  system,  with  one  faulty 
and  two  working  channels.  The  successive  levels  of  the  diagram  might 
represent  distinct  units  within  the  channel,  but  equally  they  can  represent 
successive  iterations  of  a  computation  performed  by  the  same  units.  It 
is  clear  that,  provided  that  the  two  working  channels  generate  identical 
results  initially,  each  voting  operation  will  receive  as  inputs  two  identical 
values  and  one  erroneous  value.  The  voters  in  the  two  working  channels  will 
therefore  both  produce  the  same  value  for  the  majority.  Thus  the  working 
channels  continue  to  generate  identical  results,  and  consistency  between 
working  channels  is  maintained.  However,  if  at  any  time  the  three  channels 
generate  different  results,  the  voters  can  find  no  majority  and  the  system 
fails. 

Consider  Figure  2,  which  shows  a  system  of  three  working  channels 
and  an  input  to  that  system  from  a  single  faulty  source.  The  nature  of 
the  fault  is  that  the  source  distributes  different  values  to  each  of  the  throe 
channels  (the  values  A,  B,  and  C).  Even  on  a  broadcast  bus,  such  faults 
can  result  from  marginal  timing  faults  or  from  a  marginal  transmitter  at 
the  source  and  receivers  with  slightly  different,  but  within  specification, 
characteristics.  More  complex  communication  mechanisms,  particularly 
where  software  is  involved,  permit  many  more  such  faults.  The  figure  shows 
that,  if  the  faulty  source  distributes  different  values  to  each  channel,  the 
three  channels  generate  different  results,  the  voters  can  find  no  majority. 
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Figure  5-i:  A  Three-Channel  Majority  Voted  System 
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Figure  5.2:  Distribution  of  Information  from  a  Single  Faulty  Source  to  a 
Three-Channel  System 


134 


and  the  system  fails. 

Figure  3  shows  a  three  channel  system  with  two  working  and  one  faulty 
channels.  Here  information  present  in  just  one  of  the  channels  is  to  be 
distributed  to  all  three  channels  and  be  used  in  a  replicated  calculation.  The 
faulty  source  distributes  different  values  to  the  two  working  channels,  and 
compounds  the  problem  by  repeating  the  same  erroneous  values  (suitably 
transformed  if  necessary)  in  the  next,  voted,  stage  of  the  system.  Note 
that  not  only  do  the  two  working  channels  continue  to  receive  inconsistent 
values,  even  after  voting,  but  also  each  of  the  two  working  channels  can  be 
mislead  into  believing  that  it  is  the  other  working  channel  that  is  faulty. 

The  existence  of  this  problem  was  discovered  during  the  design  of  SIFT, 
a  reliable  aircraft  control  system,  and  is  discussed  in  Pease  et  al.,  JACM 
April  1980,  where  it  is  shown  that  no  solution  is  possible  in  a  purely  three 
channel  system.  An  algorithm,  called  the  interactive  consistency  algorithm, 
is  given  for  a  four  channel  system  containing  a  single  faulty  channel,  and 
extended  to  the  masking  of  N  faults  in  a  3N+1  channel  system. 

The  basic  interactive  consistency  algorithm  is  given  in  Figure  4.  One  of 
the  four  channels  is  the  single  point  source  of  the  information,  and  the  three 
other  channels  are  used  to  replicate  that  information.  Once  the  information 
is  replicated,  any  or  all  of  the  channels  can  vote  the  replicated  information 
w'ith  confidence  that  all  voters  in  working  channels  will  produce  the  same 
majority  value,  or  alternatively  all  working  voters  will  find  no  majority  and 
will  return  a  default  value.  For  this  algorithm  to  be  effective  against  all 
faults,  the  channel  that  is  the  source  of  the  information  must  be  distinct 
from  the  three  channels  that  perform  the  replication. 

Consider  the  possibility  that  the  source  channel  is  faulty.  It  may  then 
distribute  different  values  to  the  other  channels.  The  three  replicating 
channels  must  all  be  working,  and  tnus  every  working  voter  must  get  the 
same  set  of  inputs.  If  at  least  two  of  the  replicating  channels  have  the  same 
value,  every  working  voter  will  find  that  value  as  its  majority,  while  if  all 
three  replicating  channels  have  different  values,  every  working  voter  will 
return  the  default  value.  (If  the  source  is  faulty,  the  interactive  consistency 
algorithm  cannot  of  course  guarantee  a  correct  value  from  that  source,  but 
only  a  value  that  is  consistent  across  all  working  channels.) 

Consider  the  possibility  that  one  of  the  three  replicating  channels  is 
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Figure  5.3:  Distribution  of  information  from  a  Single  Channel  to 
Three  Channels 
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Figure  5.4:  The  Interactive  Consistency  Algorithm 
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faulty.  Now  the  source  is  necessarily  working  and  will  distribute  the  same 
correct  value  to  each  of  the  two  working  replicators,  which  will  replicate  it. 
Thus  each  working  voter  obtains  at  least  two  correct  inputs  and  is  able  to 
produce  the  correct  value  as  its  result. 

In  SIFT,  four  circumstances  were  found  in  which  a  value  from  a  single 
source  had  to  be  distributed  to  three  replicated  channels,  namely; 

•  input  from  a  sensor 

•  error  reports  from  a  voter 

•  interfaces  between  unreplicated  and  replicated  tasks 

•  synchronization  of  processor  clocks. 

The  first  three  of  these  require  the  use  of  the  interactive  consistency 
algorithm  to  protect  the  system  against  malicious  faults.  The  fourth  is  of 
special  interest  in  that  exact  agreement  is  not  necessary  for  clock  synchro¬ 
nization,  and  thus  slightly  simpler  algorithms  guaranteeing  approximate 
agreement  suffice. 

5.3  Maintenance  of  Approximate  Consistency 

In  SIFT,  as  in  many  other  fault  tolerant  systems,  each  processor  has  its 
own  clock  and  operation  of  the  system  depends  on  these  clocks  remaining 
synchronized  (to  within  50ms  in  SIFT).  Many  prior  systems  used  three 
channels,  three  clocks,  and  a  clock  synchronization  algorithm  based  on  each 
clock  synchronizing  itself  periodically  to  the  median  clock  of  the  three.  It 
is  instructive  to  consider  why  this  “obviously  sound”  approach  is  invalid. 

Figure  5  shows  a  system  with  two  working  clocks  (A  and  B)  and  a 
faulty  clock  (C).  We  may  assume  that  clock  A  runs  slightly  faster  than 
clock  B.  Clock  C  presents  to  clock  A  an  erroneous  clock  value  indicating 
that  clock  C  is  running  faster  even  than  clock  A,  causing  clock  A  to  assume 
that  it  is  the  median  clock.  Thus  clock  A  makes  no  correction  to  its  value. 
Similarly,  clock  C  presents  to  clock  B  a  value  indicating  that  it  is  behind 
even  clock  B,  causing  clock  B  to  assume  that  it  is  the  median  clock  and 
make  no  correction  to  its  clock  value.  By  this  strategy,  the  faulty  clock  C 
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Figure  5.5:  A  Failure  Mode  of  the  Median  Clock  Synchronization  Algorithm 


can  induce  clocks  A  and  B  to  operate  without  correcting  their  clock  values 
as  they  gradually  drift  apart  until  the  system  fails.  Single  point  component 
faults  that  could  cause  this  “malicious”  behavior  have  been  found  even  in 
purely  analog  clock  systems. 

It  is  tempting  to  attempt  minor  corrections  to  the  three  channel  clock 
synchronization  algorithms,  aimed  at  preventing  this  behavior.  As  yet  we 
have  no  rigorous  mathematical  proof  that  no  three  channel  algorithm  can 
exist,  bu?^\'e  believe  that  the  approximate  agreement  needed  for  clock  syn¬ 
chronization  requires  the  same  number  of  channels  as  the  exact  agreement 
discussed  above. 

In  SIFT,  a  four  channel  clock  synchronization  algorithm  is  used  in  which 
each  clock  is  periodically  resynchronized  to  the  mean  of  the  four  clocks. 
To  protect  against  wildly  erroneous  clock  values,  the  algorithm  imposes  a 
bound  within  which  a  clock  value  must  lie  to  be  included  in  the  averaging 
calculation.  For  n  processors  of  which  at  most  m  are  faulty,  with  R  as  the 
resynchronization  interval  and  S  as  the  time  taken  for  resynchronization, 
and  if  e  is  the  maximum  clock  reading  error  and  p  the  maximum  rate  of 
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clock  drift,  it  can  be  shown  that  the  maximum  skew  between  working  clocks 
will  not  exceed 
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A  similar  problem  has  been  examined  by  L.  Webster  in  closed  loop  con¬ 
trol  systems.  lie  found  that  use  of  a  median  voting  algorithm  in  a  three 
channel  system  favors  the  median  channel,  effectively  disconnecting  the  two 
other  channels  from  the  closed  loop.  Without  cross  coupling  between  the 
integrators  of  the  three  channels,  this  results  in  uncontrolled  accumulation 
of  error  terms  in  the  integrators  of  two  of  the  channels,  rendering  them  use¬ 
less  for  error  masking.  With  cross  coupling,  the  integrators  are  vulnerable 
to  precisely  the  same  problem  as  the  clocks  above. 

The  possibility  of  failure  to  maintain  approximate  consistency  appears 
to  exist  in  any  three  channel  system  containing  embedded  integrators. 


5,4  Asynchronous  Multichannel  Systems 

Existing  fault  tolerant  multichannel  systems  using  forward  error  correction, 
whether  majority  voted  or  dual-dual,  depend  on  an  exact  equality  between 
the  result  values  of  the  various  channels.  To  ensure  this  exact  equality 
of  their  outputs,  the  various  channels  must  all  perform  exactly  the  same 
calculation  on  exactly  the  same  input  values  at  approximately  the  same 
time.  This  exposes  such  systems  to  an  unquantifiable  risk  of  correlated 
faults  generating  errors  simultaneously  in  several  channels.  Such  correlated 
faults  might  result  from  some  external  influence,  such  as  lightning  or  cosmic 
rays,  or  from  accumulation  of  latent  faults  not  within  the  coverage  of  the 
diagnostics,  or  from  design  faults  in  the  hardware  logic  or  the  software. 

A  much  higher  degree  of  confidence  in  the  resilience  of  the  system  to 
correlated  faults  would  result  from  a  system  design  in  which  each  channel 
performs  its  calculation  at  different  times,  on  different  input  values,  and 
obtains  different  outputs.  It  is  even  possible  to  consider  the  use  of  different 
algorithms  in  each  of  the  channels.  Unfortunately,  as  exhibited  above, 
without  an  exact  match  between  channels,  standard  voting  techniques  are 
vulnerable  to  faults  that  cause  loss  of  consistency  between  channels  and 
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Figure  5.6;  Extrapolation  from  Past  Values  to  a  Most  Probable 
Current  Value 


thus  system  failure.  We  seek  here  to  provide  alternative  algorithms  that 
permit  differences  between  channels  without  risk  of  loss  of  consistency. 

The  first  thoughts  on  an  approach  to  such  asynchronous  error  masking 
envisage  a  system  of  four  channels.  Each  channel  operates  at  the  required 
iteration  rate  but  completely  unsyschronized  with  the  other  channels,  thus 
minimizing  interaction  between  channels.  Each  result  produced  would  carry 
a  timestamp.  A  processor,  when  voting  such  a  result,  would  haye  access 
to  the  four  most  recent  values,  one  from  each  channel,  together  with  their 
timestamps.  From  these  it  would  be  possible  to  extrapolate  to  a  most 
probable  current  value,  as  shown  in  Figure  6. 

More  formally,  if  is  the  I’th  broadcast  result  from  processor  p,  con¬ 
taining  a  value  and  a  timestamp  f,-,p,  and  if  the  most  recent  result  so 
far  received  from  processor  p  is  rip,  the  algorithm  can  be  expressed  as: 

consensus  value  =  F(v„,,a,tn.,a,  V„,,b,  Vn,,c,  tn,,c, 

where  F  is  some  function  to  be  determined,  and  a,  b,  c,  d  are  the  four  pro¬ 
cessors. 
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Unfortunately,  it  is  easy  to  show  that  the  timestamps  do  not  assist  in  the 
maintenance  of  consistency  in  the  absence  of  any  constraints  on  the  times 
at  which  results  are  calculated.  If  greater  weight  is  given  to  more  recent 
values,  those  values  may  be  erroneous  values  increasing  the  vulnerability 
of  the  system.  In  particular,  consider  the  case  in  which  three  good  values 
are  reported  approximately  simultaneously  and  subsequently  an  erroneous 
value  is  reported.  Any  preference  given  to  recent  values  can  only  render 
the  consensus  less  reliable  than  that  obtained  by  ignoring  the  timestamps. 

Consideration  can  also  be  given  to  the  clock  synchronization  algorithm 
described  above.  Here,  if  processor  a  is  considering  the  values  generated 
by  processors  b,c,d,  with  current  values  Va,vi,,Vc  and  I'j, 

For  »  in  b,c,d  :  v[  =if  V{  >  I'a  -t-  <5  or  t’,-  <  Va  —  5 
then  I'a 
else  Vi 

and  then:  consistent  result  = 

That  algorithm  does  indeed  maintain  consistency  between  channels,  but 
the  rate  of  convergence  is  very  weak  and  the  drift  and  error  signals  that 
can  be  introduced  by  undetected  faulty  clocks  are  much  larger  than  the 
permitted  drift  and  jitter  of  working  clocks.  In  the  clock  synchronization 
application  this  is  not  critical  for  the  individual  clocks  have  performance 
characteristics  much  better  than  those  required  for  typical  system  applica¬ 
tions.  For  a  control  system  application  however,  the  errors  introduced  by  a 
faulty  channel  can  easily  overwhelm  the  control  action  of  the  system,  and 
thus  such  an  algorithm  is  clearly  unacceptable. 

A  possible  alternative  approach  requires  that  the  four  channels  compute 
their  results  at  uniform  phases  within  the  iteration  interval,  one  channel 
generating  a  value  at  the  start  of  the  interval,  a  second  channel  generating 
its  result  a  quarter  of  the  interval  later,  etc.,  as  shown  in  Figure  7.  This 
additional  information  allows  the  algorithm  an  improved  ability  to  compute 
a  most  probable  current  value  and  to  reject  erroneous  values.  The  uniform 
spacing  at  which  results  are  generated  through  the  interval  greatly  simplifies 
calculations  compared  with  a  system  in  which  such  spacings  are  arbitrary, 
and  thus  a.ssists  in  reducing  the  voting  calculation  overhead. 
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Figure  5.7;  Calculation  of  Results  at  Uniform  Phases  within  an  Interval 

An  initial  evaluation  of  such  a  system,  using  the  arithmetic  mean  of 
the  four  values  for  the  most  probable  current  value,  as  in  the  clock  syn¬ 
chronization  algorithm.  Each  channel  uses  fixed  limits  for  the  acceptable 
deviation  of  the  values  computed  by  other  channels  from  its  own  most  re¬ 
cent  value,  but  those  limits  can  differ  for  each  of  the  other  channels.  Thus 
if  6  is  an  appropriate  acceptable  deviation  for  the  channel  whose  result  wa.s 
computed  one  quarter  of  an  iteration  later,  then  1.35  is  an  appropriate  limit 
for  the  channel  computing  half  an  iteration  later  and  1.25  for  the  channel 
computing  three  quarters  of  an  iteration  later. 

Here,  if  processor  a  is  considering  the  values  generated  by  processors 
b,c,d,  with  current  values  Va,vi„Vc  and  vj, 

v[  =  if  t'b  >  t’o  -h  5  or  Vb  <  Va  -  6 

then  Va 
else  Vb 

v'  =  if  We  >  Uo  +  1-35  or  <Va  —  1.35 
then  Va 
else  Vc 
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v'^  =  if  Vrf  >  Ua  +  1.25  or  Vd  <  t’o  -  1-25 

then  Va 
else  Vd 

and  then;  consistent  result  = 

Unfortunately,  while  this  algorithm  appears  to  be  better  than  the  basic 
clock  synchronization  algorithm,  it  is  only  slightly  so  and  the  drift  and 
error  signals  introduceable  by  a  fault  are  still  at  least  comparable  to  the 
maximum  permissible  control  action  of  the  system.  Thus  the  algorithm  is 
still  unacceptable. 

We  can  refine  the  algorithm  by  giving  different  weights  to  each  of  the 
values,  for  instance: 


consistent  result  = 


•.0 


but  the  effect  is  marginal  and  still  far  from  providing  acceptable  margins 
for  control  purposes. 

Error  masking  algorithms  such  as  these  act  as  filters  and,  like  all  filters, 
necessarily  introduce  delay  into  the  control  loop.  The  algorithms  above 
introduce  a  delay  of  about  2/3  of  an  iteration.  To  maintain  the  same 
margins  of  loop  stability,  the  introduction  of  such  a  delay  would  require  an 
increase  in  the  iteration  rate  of  about  33%. 

A  number  of  possible  improvements  to  the  algorithm  are  under  consider¬ 
ation.  We  are  currently  working  on  algorithms  that  make  better  use  of  the 
relative  timing  of  results,  both  by  giving  greater  weight  to  more  recent  re¬ 
sults  in  estimating  the  most  probable  current  value,  and  also  by  considering 
the  values  generated  by  other  channels  when  determining  the  acceptability 
of  a  result.  A  further  possibility  is  the  use  of  a  five  channel  system  fully 
capable  of  rejecting  the  most  malicious  faults  which  degrades  on  the  first 
reconfiguration  to  a  four  channel  system  capable  of  rejecting  all  faults  ex¬ 
cept  those  malicious  faults  in  which  different  information  is  delivered  to 
different  destinations  by  the  broadcast  mechanisms.  Since  the  probability 
of  a  second  fault  during  a  mission  is  low,  and  the  probability  of  a  malicious 
fault  is  also  low,  such  a  system  might  be  judged  to  be  adequately  reliable. 
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Appendix 

Proof  of  Proposition  1 

It  follows  from  (1)  that,  for  any  operation  execution  ^  in  S,  the  relations 
— ►  and  — ►  are  not  changed  by  either  of  the  following  two  changes  to  the 
global-time  model,  where  6  >  0; 

1.  Changing  Sa  to  s^  —  S  if,  for  all  B  €  5:  /e  <  S/t  implies  /b  <  —  6. 

2.  Changing  to  /^  +  6  if,  for  all  B  €  S:  jA<  implies  -I-  <5  <  Sa- 

Let  T  denote  the  set  of  numbers  Sa  and  fA  for  all  A  in  S,  and  for  any  real 
t,  let  S{t)  =  {r  €  T  :  r  <  t}  and  F{t)  ~  {r  eT  :  r  >  t}.  M2  implies  that 
for  any  /,  max  S{t)  <  t  and  t  <  min  F{t). 

For  any  A,  if  Sa  equals  sb  or  /s  for  some  B  ^  A,  I  can  change  Sa  to 
Sa  —  <5,  where  0  <  6  <  c  is  chosen  so  that  sa  —  S  >  maxS(sA)-  Similarly, 

if  /a  equals  Sb  or  /s  for  some  B  ^  A,  I  can  change  /a  to  /a  +  S,  where 

0  <  S  <  €  and  /a  -h  S  <  min  F(sa)- 

The  details  of  the  formal  proof,  which  involves  an  inductive  definition 
of  s'  and  /'  based  upon  the  countability  of  S,  is  left  to  the  reader. 

Proof  of  Propositions  2  and  3 

The  “only  if”  part  of  Proposition  2  follows  immediately  from  (1).  To  prove 
Proposition  3  and  the  “if”  part  of  Proposition  2.  I  prove  that  for  every 
system  execution  S, — -  there  exists  a  global-time  model  s,  f  such  that 
for  every  A,  B  E  S: 

•  .4  — ►  B  implies  /a  <  sb 

•  .4  — >  B  implies  Sa  <  fa 

The  relations  — ^  and  --  -  defined  by  this  global-time  model  satisfy  the 
requirements  of  Proposition  3.  Moreover,  if  S, — satisfies  A#,  then 
---*  must  equal  since  if  A#  holds  then  A  -/-*  B  implies  B  — ►  .4, 

which  implies  B  — ^  A,  so  A  -/-*  B,  and  A  -f-*  B  implies  B  -  --*  A,  which 

/ 

implies  B  -  -  ->  A,  so  A  -hB. 
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The  following  proposition  is  used  in  this  proof  and  in  a  later  one. 

Proposition  10  Let  T  be  the  set  consisting  of  all  elements  of  the  form  Sa 
and  f^  for  A  &  S  (the  elements  of  T  are  uninterpreted  symbols,  not  nec¬ 
essarily  real  numbers],  and  let  <  be  the  smallest  transitively  closed  relation 
such  that 

•  If  A  — ►  B  then  f^  <  sb- 

•  If  A  — >  B  or  A  =  B  then  sa  ■<  fa- 

Then  <  is  an  irreflexive  partial  ordering. 

Proof ;  Define  the  relations  and  on  T  Zts  follows: 

•  For  all  .4:  Sa  /a- 

•  fA  — ^  Sb  if  and  only  if  A  — ►  B. 

•  Sa  fa  if  and  only  if  ^4  -  -  -  B. 

Let  — ►  be  the  union  of  the  three  relations  and  so  -<  is  the 

transitive  closure  of  — !t  suffices  to  prove  that  — ►  is  an  acyclic  relation. 

The  proof  is  by  contradiction.  Choose  a  shortest  cycle  formed  by  the 
— ►  relation.  A  cycle  composed  entirely  of  and  — L*  relations  would 
violate  Al,  so  the  cycle  must  contain  a  portion  of  the  form: 

/a  sb  fc  Sd 

since  — ^  is  the  only  relation  from  an  /  to  an  s  and  there  are  no  s  to  s  or  /  to 
/  relations.  I  can  apply  A4  to  deduce  that  /a  — ^  sp,  which  contradicts  our 
assumption  that  the  cycle  had  minimal  length,  proving  Proposition  10.  Q 

Returning  to  the  proof  of  Propositions  2  and  3,  we  see  that  -<  is  an 
irreflexive  acyclic  relation.  Moreover,  A5  implies  that  for  any  t  €  T,  t  <  s 
for  all  but  a  finite  number  of  elements  s.  This,  together  with  the  countabil¬ 
ity  of  T,  implies  that  -<  can  be  completed  to  a  total  ordering  <  such  that 
there  is  an  order-preserving  isomorphism  of  T  with  a  subset  of  the  natu¬ 
ral  numbers.  Identifying  the  elements  of  T  with  the  corresponding  natural 
numbers  provides  the  desired  global-time  model. 
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Proof  of  Proposition  4 

Let  T  be  the  set  of  all  numbers  and  for  4  G  5,  and  let  -<  be  the  partial 
ordering  on  T  defined  as  in  Proposition  10  for  the  precedence  relations  — ^ 
and  -  -  namely,  the  smallest  partial  order  such  that  A  — ^  B  implies 
/a  ■<  sb,  and  A--^  B  or  A  =  B  implies  <  fe-  Observ'e  that  the 
following  hold  for  all  A  and  B  \n  S: 

(a)  Either  -<  fs  or  fe  •<  Sa  (by  A#). 

(b)  /a  <  sb  implies  /a  •<  sb  (by  H3). 

To  prove  the  proposition,  it  suffices  to  construct  s',  f  such  that®  s  <  s'  < 
f'<f  and  for  all  A  and  B.  /a  <  sb  implies  <  s'g  and  s^  <  /b  implies 
<  Is- 

Let  s',  f  be  any  global  model  satisfying 

/'a  <  s'b  implies  /a  <sb  (5) 

The  pair  of  operation  executions  A,  B  is  said  to  be  ovt  of  order  for  s',  f 
if  fA  <  Sb  and  Sg  <  f a-  It  follows  from  (a)  and  (b)  that  if  there  are  no 
out-of-order  pairs,  then  s',f'  satisfies  the  conditions  of  the  proposition. 

I  will  construct  s',  f  inductively  by  constructing  a  sequence  of  nonde¬ 
generate  models  s',p  with  s'  <  s''*'*  <  /*■*■*  <  /’  having  s°.f°  equal  to 
s,  f  and  s',  f  equal  to  their  limit.  This  is  done  by  first  choosing  the  enu¬ 
meration  of  all  out-of-order  pairs  of  s,f  such  that,  for  any  subset  of  them, 
the  minimal  element  is  the  one  A,  B  having  the  smallest  value  of  and, 
among  all  such  pairs  .4,  B',  the  one  having  the  largest  value  of  sg-  It  follows 
from  M2  that  such  a  minimal  element  exists  for  any  nonempty  set,  so  this 
defines  an  enumeration  of  the  out-of-order  pairs  of  s,  f. 

If  A,B  is  the  i'*"  out-of-order  pair,  then  s’,/'  will  be  defined  to  be  the 
same  as  s'~‘,/’“'  except  that  s^”'  <  /\  <  s’^  <  This  implies  that 

the  set  of  out-of-order  pairs  for  s',  /'  equals  the  set  of  out-of-order  pair  for 
s‘~‘,/’“‘  minus  the  pair  A,B.  Moreover,  it  follows  from  Ao  and  (b)  that 
any  operation  execution  belongs  to  only  a  finite  number  of  out-of-order 

employ  the  usual  notation  that  for  functions  /  and  g  with  the  same  domain,  /  <  <;  if 
and  only  if  /(z)  <  g{x)  for  all  z  in  their  domain. 
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pairs  of  5,/,  so  the  limit  s',f'  of  the  models  s',p  exists,  satisfies  (5),  and 
has  no  out-of-order  pairs,  proving  the  proposition. 

For  notational  convenience,  the  construction  of  s',/’  from  s'~^/'■*  is 
given  for  the  case  i  =  0.  So,  I  assume  that  s,/  satisfies  (b),  which  is  the 
same  as  (5),  and  has  a  minimal  out-of-order  pair  A,B.  I  construct  s’,/' 
by  decreasing  /a  and  increasing  to  get  f\  <  without  creating  any 
new  out-of-order  pairs.  (The  construction  for  any  i  is  the  same  except  with 
more  superscripts.) 

Let  A'  be  the  operation  execution  with  the  largest  value  of  sx  such  that 
sx  ■<  //»;  if  there  is  no  such  X,  let  ax  =  —00.  It  follows  from  (b)  and  the 
nondegeneracy  of  s,f  that  sx  <  /a-  Observe  that  there  is  no  C  with  sc 
in  the  interval  (max(s,v, 5b), /aL  since,  by  choice  of  sx,  this  would  imply 
/a  -<  which  would  contradict  the  ma.ximality  of  sq.  Therefore,  if  I 
define  f\  to  be  max(ax, ^s)'*’,  then  s, /'  satisfies  (5)  and  has  the  same  set 
of  out-of-order  pairs  as  s,/,  where  t'*‘  denotes  a  value  larger  than  t  such 
that  there  is  no  value  sc  or  fc  in  the  interval  (Lt'*’]. 

If  Sfl  >  Sx,  so  f\  —  3g,  then  I  can  define  to  be  (/\)'^  and  it  is  clear 
that  /'  also  satisfies  (5)  and  has  the  same  set  of  out-of-order  points  as 
s,  /'  except  that  .4,  B  is  not  out  of  order  for  s',  /',  so  we  are  done. 

Therefore,  I  need  only  consider  the  case  sg  <  sx-  (Since  sx  <  /a,  we 
must  have  sg  ^  s.y  )  I  claim  that  there  is  no  fc  in  the  interval  [sb,  5x]-  If 
there  were,  then  (a)  and  (b)  imply  that  fc  <  sx  and  sg  ■<  fc,  which,  since 
5X  ^  /.A  ,  would  imply  sg  ■<  fA,  contrary  to  the  assumption  that  A,  B  is 
out  of  order  for  a,  /.  Therefore,  defining  5  ®  to  be  the  same  as  s  except  with 
s'^  =  s^,  we  see  that  s"®,  /'  satisfies  (5)  and  has  the  same  set  of  out-of-order 
pairs  as  s,/'.  Replacing  a  by  s  ®  and  starting  our  argument  again,  we  are 
in  the  rase  <  Sg  that  was  considered  above.  This  completes  the  proof. 

Proof  of  Proposition  5 

If  — ►  and  -  --  are  any  relations  on  a  set  S,  let  the  completion  of  — ►  and 
— ►be  the  relations  — ^  and  ---►,  where  — ^  is  the  smallest  transitively 
closed  extension  of  — ►  such  that  ^4  — ^  B  C  — ^  D  implies  .4  — ^  D, 
and  ---is  the  union  of  — >  and  — Thus,  .4  — ^  B  if  and  only  if  there 
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exists  a  chain 

A- Ai=>  An  =  B 

where  =>  denotes  either  — ►  or  — ►  C  — ^  D  — ►  for  some  C  and  D. 

Proposition  11  If  — ^  satisfies  A5;  is  the  completion  of  — ► 

and  is  acyclic;  then  is  a  system  execution. 

Proof  :  I  must  show  that  satisfies  A1-A5.  The  only  nonob- 

vious  part  is,  in  the  proof  of  A2,  showing  that  if  A  B  then  B  -/-  A. 
However,  as  observed  above,  this  follows  from  A1  and  A4.  B 

To  prove  Proposition  5,  let  — ^  be  the  union  of  the  relations  — ►  and 
and  let  -  - -*  be  the  union  of  -  --  and  the  restriction  of  to  T.  Note 
that  the  restriction  of  to  M  equals  (by  H3).  I  define  — -  -*  to  be 
the  completion  of  — 

I  claim  that  to  prove  Proposition  5,  it  suffices  to  show  that  — i-  is  acyclic 
and  the  restrictions  of  and  to  )l  equal  and  Proposition  11 
then  implies  that  U  U  T is  a  system  execution,  which  is  easily  seeji 
to  be  implemented  by  5  U  T, — (The  definition  of  ^  and  — - 
implies  that  their  restrictions  to  T  are  extensions  of  and 

Moreover,  I  claim  that  it  suffices  to  prove  that  the  restriction  of  — ►  to 
)f  equals  It  follows  immediately  from  the  definition  of  and  A2  that 
if  the  restriction  of  ^  equals  then  the  restriction  of  — to  M  must 
equal  -  -  Furthermore,  the  definition  of  the  completion  and  the  acyclicity 
of  -Lf  imply  that  any  cycle  of  relations  must  include  an  element  of  )I, 
so  A  A  must  hold  for  some  A€  )I.  If  the  restriction  of  — ►  to  ^  equals 
then  the  acyclicity  of  follows  from  the  acyclicity  of  Thus,  it 
suffices  to  prove  that  if  A  B  then  A  — ►  B. 

By  definition  of  if  A  B  then  there  exists  a  chain  A  =  Ai  => 
...  =>  An  —  B,  where  =>  denotes  either  or  C  D  — ►. 
Note  that  if  A,-  and  A,+i  are  both  in  M,  then  A,-  =>  A,+i  implies  that 
Ai  A,+i,  and  if  they  are  both  in  T  then  A,-  ==>  A,+i  implies  that 
Ai  A,+i.  Therefore,  it  suffices  to  show  that  any  such  chain  that  is  of 
minimal  length  has  length  one. 
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If  three  consecutive  elements  ^4,-,  i4,+i,  and  A, +2  in  this  chain  are  either 
all  in  If  or  all  in  T,  by  the  transitivity  of  and  — ^  it  follows  that 
A,-  =5>-  A,+2-  Therefore,  in  a  minimal-length  chain,  A,-  must  be  in  1/  if  i  is 
odd  and  in  T  if  i  is  even.  If  n  >  0,  then  we  have  Aj  =>  A2  Aj,  with 
Ai  and  As  in  M  and  Ao  in  T.  A  relation  between  an  element  of  If  and 
an  element  of  T  must  be  a  — ^  relation.  Considering  the  two  possible  cases 
for  each  ==>  relation,  using  Al  and  A4  for  the  relations  and  it 
follows  from  Ai  A2  ==>  As  that  Ax  A2  As,  so  Ax  As-  This 
contradicts  the  assumption  of  the  minimality  of  n,  proving  that  n  =  1  and 
A  B,  which  completes  the  proof  of  the  proposition. 

Proof  of  Propositions  6  and  7 

Parts  (a)  and  (b)  of  Proposition  6  are  immediate  consequence  of  Defini¬ 
tion  4.  To  prove  part  (c),  observe  that  this  definition  implies  ---  yh'-'l. 
The  result  is  immediate  if  j  =  0.  If  j  >  0,  then  Vb~d  — ►  V'b’l.  Combining 
these  two  relations  with  the  hypothesis,  we  have 

_ _  _ _ 

Axiom  A4  implies  that  Vb"!!  — ►  which,  by  A2,  implies 

j'jjjg  finishes  the  proof  of  Proposition  6. 

To  prove  part  (a)  of  Proposition  7,  observe  that  it  follows  immediately 
from  Definition  4  that  Pl**  ---*/?  implies  k  <  Conversely,  I  assume  k  <  j 
and  show  this  implies  — >  R.  Since  Pb'I  the  desired  conculsion 

is  immediate  if  fc  =  j.  If  k  <  j,  then  and  it  follows  from  .43. 

For  part  (b).  Definition  7  implies  that  if  i  <  k'  then  R  — >  Letting 

k'  =  k  +  I,  this  shows  that  if  i  <  I:  then  /2  -  -  -  Conversely,  suppose 

R  — >  Then  k+ 1  ^  i.  If  k+ 1  <  t,  then  pl*+d  — ►  V'1‘1,  so  A3  would 

imply  Pf*'  contrary  to  Definition  4.  Hence,  we  must  have  i  <  k  +  I 

so  i  £  k,  *hj  proof  of  Pronosit;''*’  7. 

Proof  of  Propositions  8  and  9 

Apply  Proposition  3  to  extend  the  given  — ‘  and  — >  relations  so  they 
satisfy  A#.  It  follows  from  Bl  that  this  extension  does  not  add  any  new 
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precedence  relations  between  reads  and  writes.  A  read  sees  as  defined 
by  these  new  relations,  if  and  only  if  it  sees  in  the  original  system 
execution.  Hence,  the  new  system  execution,  which  satisfies  A#,  satisfies 
the  hypotheses  of  the  appropriate  proposition.  Applying  Proposition  2,  I 
can  therefore  assume  a  nondegenerate  global-time  model  for  the  system 
execution. 

For  the  proof  of  Proposition  9,  let  <f>  be  the  assumed  function.  For  the 
proof  of  Proposition  8,  0  is  defined  as  follows.  If  is  a  read  that  sees 
for  a  safe  register  define  0(i2)  to  equal  j,  and  for  a  regular  register 
define  it  to  be  a  value  satisfying  conditions  1  and  2  in  the  hypothesis  of 
Proposition  9.  (B4  implies  that  such  a  definition  is  possible.) 

I  first  show  that  5, — (which  I  am  assuming  to  have  a  nondegen¬ 
erate  global-time  model)  trivially  implements  a  system  execution  in  which 
reads  are  instantaneous,  which  is  all  that  is  required  to  prove  Proposition  8. 
Given  the  nondegenerate  global-time  model  s,  /  for  S, — it  suffices 
to  find  a  global-time  model  s',f'  with  s  <  s'  <  f  <  f  in  which  all  reads 
are  instantaneous,  such  that  B1-B4  hold  for  the  system  execution  defined 
by  s',  /'. 

For  notational  convenience,  let  s,-  and  /,•  denote  svio  and  /vm,  respec¬ 
tively.  Let  s',  /'  be  the  same  as  s,  /  except  that,  for  a  read  R,  define  s'^  to 
equal  the  maximum  of  the  following  three  quantities: 

•  Sr 

•  (5^(n))'*’ 

•  max{sn'  :  <i>(R')  <  <t>(R)  and  sr*  <  //?}■*■ 

and  define  to  equal  (s'r)'^.  When  the  appropriate  careful  definition  of 
is  given,  this  results  in  a  nondegenerate  global-time  model  in  which  every 
read  is  instantaneous.  I  must  check  that,  for  any  read  R:  s r  <  s'f^  <  < 

/r,  B1-B3  remain  satisfied,  and  B4  remains  satisfies  when  v  is  regular. 

It  is  immediate  by  the  definition  of  s'r  that  sr  <  s'r.  Since  Jr  =  (s'r)"^, 
to  establish  the  remaining  inequalities,  I  need  to  show  that  /r  <  Jr.  UR 
sees  then,  by  Definition  4,  Sj  <  /r  (the  strict  inequality  comes  from 
nondegeneracy),  and,  since  <f>(R)  <  j,  S4^R)  <  Jr.  The  required  inequality 
now  follows  easily  from  the  definition  of 
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I  must  now  show  that  B1-B3  and,  if  v  is  regular,  B4  hold  for  the  new 
precedence  relations.  Bl  and  B2  are  trivial.  For  B3  and  B4,  consider  what 
a  read  sees  in  the  new  system  e.xecution  if  it  sees  in  the  original  one. 
There  are  three  cases: 

1.  If  ^(/j)  <  Sft  then 

(a)  if  sn  <  s^(/i)+i  then  R  sees 

(b)  if  e0{/i)+i  <  Sn  then  R  sees 

2.  If  Sr  <  /^(R)  then  R  sees 

Moreover,  it  is  immediate  from  Definition  4  that  case  1(b)  is  impossible 
if  (i>{R)  =  j,  which  is  the  case  when  v  is  assumed  to  be  only  safe.  This 
definition  also  implies  that  fj  <  sr  if  and  only  if  i  =  j.  Thus,  when  v  is 
only  safe,  R  sees  in  the  new  system  execution  if  and  only  if  it  does 
in  the  old,  proving  B3.  For  the  case  when  v  is  regular,  B3  and  B4  follow 
immediately  from  the  fact  that  R  returns  the  value  This  finishes 

the  proof  of  Proposition  8. 

To  complete  the  proof  of  Proposition  9,  I  first  show  that  '\f  (p(R)  <  <p(S) 
for  reads  R  and  5,  then  <  s'^-  The  third  hypothesis  about  <j>  implies  that 
if  <t>{R)  <  0(5),  then  sr  <  fs-  By  the  definition  of  s'g,  this  implies  that  s'g 
is  greater  than  each  of  the  three  quantities  of  which  is  the  maximum,  so 
s'n  <  s's-  Since  reads  are  instantaneous  with  respect  to  s',/',  this  implies 

fk  < 

I  must  construct  a  new  global-time  model  in  which  writes  are 

also  instantaneous  and  B1-B3  are  still  satisfied,  so  that  s",/"  is  the  same 
as  s',  /'  except  for  writes,  and  for  any  write  s'*  <  s*  <  /*  <  /*.  (Note 
that  B5  follows  from  the  fact  that  reads  and  writes  are  instantaneous,  and 
B4  follows  from  B3  and  B5.) 

Let  s'*  be  the  maximum  of  the  two  quantities  s*  and  max{/)j  :  6(R)  = 
k  —  1}"^,  and  let  /*'  be  (s'*)"^.  Since  is  one  of  the  values  “seen”  by 

R  in  the  system  execution  defined  by  s',/',  if  4>(R)  =  k  —  I  then  s^  <  /*, 
which  implies  that  s'*  <  /*.  We  therefore  have  s'  <  s"  <  /"  <  /',  and  reads 
and  writes  are  both  instantaneous  in  s",/".  Again,  Bl  and  B2  are  trivial, 
so  I  need  only  prove  B3. 
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Since  reads  and  writes  are  instantaneous,  B5  holds — a  read  R  sees 
I  must  show  that  i  =  The  definition  of  s"  implies  that  fR=fR< 

^  niust  therefore  show  that  In  the  global-time  model 

s',/',  the  read  R  “sees  the  value”  so  s'^^yjj  <  s'^.  By  definition  of  s", 

we  can  have  only  if  there  exists  some  R'  with  0(i2')  <  </)(/?)  and 

/r'  >  Sfi-  However,  I  showed  above  that  R'  <  R  implies  /{j,  <  s'^,  which 
completes  the  proof. 
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Part  VI 

Experimental  Implementation 
and  Evaluation  of  the  Trans 
Broadcast  Protocol 
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6.1  Introduction 


An  eaxlier  section  of  this  report  (Part  HI)  introduced  a  novel  link-level  pro¬ 
tocol  for  broadcast  environments.  The  protocol,  known  as  TRANS,  exploits 
the  characteristics  of  broadcast  communications  media  in  order  to  achieve 
reliable  communication  with  minimal  overhead. 

This  section  of  the  report  describes  a  prototype  implementation  of 
Trans,  which  was  undertaken  so  that  the  design  and  performance  of  the 
protocol  could  be  evaluated.  A  great  deal  was  learned  about  the  behavior 
of  the  protocol  during  this  process,  including  subtle  problems  in  its  design. 
However,  this  experimental  implementation  was  undertaken  towaxds  the 
very  end  of  the  project,  when  time  and  funds  were  almost  exhausted,  and 
we  were  therefore  unable  to  completely  resolve  some  difficulties  in  the  de¬ 
sign  of  Trans,  or  to  collect  as  much  data  on  its  performance  as  we  would 
have  wished. 

We  believe  that  the  subtle  problems  and  difficulties  encountered  in  the 
implementation  of  TRANS  vindicate  the  decision  to  undertake  that  imple¬ 
mentation.  Protocols  are  notoriously  difficult  to  get  right,  and  claims  based 
on  only  informal  specifications  and  correctness  arguments  (as  was  the  case 
with  the  previous  description  of  TraNS)  should  be  viewed  with  skepticism. 
The  problems  discovered  in  TRANS  do  not  appear  major  and  we  believe 
they  czm  be  corrected.  Unfortunately,  there  simply  was  not  enough  time 
to  address  them  during  this  contract.  The  performance  mezisurements  that 
we  were  able  to  make  are  encouraging  and  suggest  that  broadcast  protocols 
such  as  Trans  offer  useful  benefits  in  certain  situations. 

This  experimental  implementation  and  evaluation  of  TRANS  has  sug¬ 
gested  several  directions  for  future  reseairch.  An  increased  understanding 
of  the  protocol  has  indicated  several  modifications  that  would  lead  to  im¬ 
proved  performance.  It  has  become  clear  that  the  protocol  should  be  r.pec- 
ified  and  proved  formally,  and  that  implementation  considerations  must  be 
addressed.  Additional  performance  measurements  are  required  to  complete 
the  evaluation  of  the  protocol  zmd  comparisons  should  be  made  against  al¬ 
ternative  appro2w:hes.  There  are  also  several  extensions  to  the  protocol  that 
can  be  examined.  Finally,  the  protocol  can  be  used  as  the  basis  for  the  de¬ 
sign,  implementation,  and  evaluation  of  a  variety  of  distributed  systems 
algorithms. 
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6.2  Specification  of  the  TRANS  Protocol 

The  starting  point  for  our  implementation  of  TRANS  is  the  description  of 
the  protocol  given  in  Part  HI  of  this  report.  However,  rather  than  simply 
reproducing  that  description  here,  it  will  be  useful  to  first  provide  some 
additional  motivation  and  discussion. 

The  context  In  which  Trans  is  to  operate  assumes  a  communications 
medium  using  physical  broadcast  (such  as  Ethernet,  or  packet  radio),  and 
an  applications  environment  that  requires  reliable  broadcast  communica¬ 
tions.  Conventional  protocols  that  <issume  point  to  point  communications 
could  require  a  minimum  of  2  x  (n  —  1)  messages  to  transmit  a  message  from 
one  of  n  hosts  to  all  the  others  (composed  of  n  —  1  individual  transmissions 
from  the  sender  to  each  recipient,  and  the  same  number  of  acknowledg¬ 
ments).  A  protocol  that  allows  broadcast  transmission  but  that  requires 
iudividual  acknowledgments  could  reduce  this  to  n  messages  (l  trzuismis- 
sion  and  n  —  1  acknowledgments). 

If  we  are  prepared  to  wait  for  acknowledgments  until  receiving  hosts 
have  messages  of  their  own  to  transmit,  then  no  additional  messages  may 
be  required  beyond  the  initial  broadcast:  receiving  hosts  simply  save  up 
acknowledgments  and  append  them  to  their  own  messages.  Assuming  a 
community  of  n  hosts  all  broadcasting  at  approximately  the  same  rate,  this 
could  require  each  host  to  append  an  average  of  n  —  1  acknowledgments  to 
each  of  its  own  messages. 

The  novel  contribution  of  TRANS  is  that  it  attempts  to  reduce  the 
number  of  acknowledgments  that  must  be  appended  to  each  message  by 
exploiting  the  broadcast  character  of  the  communications  medium  and  the 
transitivity  of  acknowledgments.^  If  a  host  needing  to  acknowledge  a  mes¬ 
sage  Y  sees  another  message  X  czurying  an  acknowledgment  for  Y,  then  it 
need  not  acknowledge  Y  explicitly:  its  acknowledgment  of  X  will  implic¬ 
itly  acknowledge  receipt  of  Y.  Under  favorable  circumstances,  this  could 
result  in  each  host  having  to  explicitly  acknowledge  only  1  message  in  each 
of  its  own  messages — the  remaining  n  —  2  being  acknowledged  implicitly. 
This  can  significantly  reduce  the  bandwidth  needed  for  a  given  degree  of 
communication.  In  addition,  it  can  significantly  reduce  the  amount  and 
frequency  of  communications  required  from  individual  hosts.  This  could 

^The  name  of  the  protocol  is  derived  from  TRANSitivity. 
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be  beneficial  in  packet  radio  situations,  for  example,  where  certain  stations 
are  attempting  to  operate  under  near  radio-silence. 

The  naive  protocol  outlined  above  must  obviously  be  modified  to  deal 
with  the  circtimstance  where  a  host  fails  to  receive  a  message.  Accordingly, 
negative  acknowledgments  are  introduced  so  that  hosts  can  indicate  such 
failures.  Henceforth,  a  (positive)  acknowledgment  will  be  referred  to  as  an 
ack,  while  a  negative  acknowledgment  is  called  a  nack.  (Machines  will  be 
referred  to  as  hosts,  although  the  earlier  section  on  TRANS  refers  to  them 
as  nodes.)  A  host  should  append  a  nack  to  its  next  message  if  it  receives 
a  message  in  a  corrupted  state  (but  is  able  to  recover  the  identity  of  the 
message),  or  learns — through  the  presence  of  acks  on  other  messages — of 
the  existence  of  a  message  that  it  has  not  received.  Such  nacks  provoke 
the  sender  of  the  message  concerned  to  retransmit  it.  A  host  that  has  a 
pending  nack  can  discard  it  if  it  sees  another  message  carrying  a  nack  for 
the  same  message,  since  that  prior  nack  will  already  be  sufficient  to  provoke 
the  retransmission  that  is  desired. 

Although  the  incorporation  of  nacks  into  the  protocol  may  seem  a  small 
change,  concerned  solely  with  liveness,  it  turns  out  to  greatly  complicate 
the  “reception  analysis”  component  of  the  protocol. 

This  problem  can  be  seen  in  the  example  shown  in  Figure  6.1.  In  this 


nack  /  \  ack 

/  \ 

Y  X 
ack  \  /  ack 

\  / 

W 

Figure  6.1:  Difficulty  Introduced  by  nacks 

and  subsequent  similax  figures,  the  named  nodes  (X,  Y,  Z  etc.)  represent 
messages,  and  the  arcs  between  them  represent  (some  of  the)  acks  and 
nacks  carried  by  the  message  at  the  bottom  of  the  arc.  The  time  dimension 
runs  down  the  page,  so  the  example  in  Figure  6.1  indicates  that  messages 
Y  and  X  were  sent  sometime  later  than  Z,  and  that  Y  nacked  Z  while 
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X  acked  it.  Message  W  carried  acks  for  both  X  and  Y.  The  question  is: 
can  we  deduce  whether  or  not  the  host  that  sent  W  saw  message  Z?  The 
answer  is  that  it  is  very  difficult  to  make  such  deductions  in  the  presence 
of  nacks.  Suppose  the  sender  of  W  did  see  Z,  and  that  it  then  saw  X. 
Since  X  cajxies  an  ack  for  Z,  the  sender  of  W  will  discard  its  own  ack  for 
Z  and  acknowledge  it  implicitly  in  its  ack  for  X.  On  the  other  hand,  if  we 
assume  that  the  sei  der  of  W  did  not  see  Z,  then  exactly  the  same  argument 
applies  mutatis  mtitandis  with  respect  to  Y  and  nacks.  It  might  seem  that 
this  ambiguity  could  be  resolved  if  the  sender  of  W  were  not  so  hasty  to 
discard  its  own  pending  ack  for  Z:  then  it  could  explicitly  ack  Z  once  it 
saw  the  nack  carried  by  Y.  A  little  thought  will  show  that  this  stratagem 
cannot  be  relied  upon.  Consider  the  situation  pictured  in  Figure  6.2.  Here, 


Z 

1\  nack 
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ack  I  X 
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Figure  6.2:  Further  Difficulty  Introduced  by  nacks 

the  host  that  sent  W  (wMch  is  assumed  to  have  seen  Z)  might  have  been 
prepared  to  directly  ack  Z  had  it  known  of  the  cunbiguity  introduced  by 
the  nack  carried  by  X,  but  it  may  not  itself  have  seen  X  (the  nack  carried 
by  Y  will  have  caused  it  to  disczird  its  own  nack  for  X)  aind  may  therefore 
be  unaware  of  the  nack  which  X  carries  for  Z. 

These  examples  show  that  a  nack  introduces  uncertainty  as  to  whether 
any  messages  further  along  an  ack  chain  have  been  seen  or  not.  Thus  there 
is  little  point  in  retaining  acks  for  messages  that  others  have  nacked — and 
so  the  Trans  protocol  discards  both  pending  nacks  and  acks  whenever  a 
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nack  is  seen  for  the  message  concerned. 

The  previous  discTission  should  have  motivated  the  essential  components 
of  the  Trans  protocol,  whose  description  from  Part  III  of  this  report  is 
repeated  below: 

•  Each  message  is  broadcast  with  a  header  in  which  there  is  a  mes¬ 
sage  identifier  containing  the  source  of  the  message  and  a  message 
sequence  number.  A  version  number  is  also  included  in  the  identifier 
to  distinguish  retransmissions.  Sequence  numbers  can  recycle  over 
some  suitably  long  interval.  Each  message  also  carries  with  it  ac¬ 
knowledgments  (positive  and  negative)  to  previous  messages,  and  an 
error  detecting  code.  Other  fields  in  the  header,  such  as  a  message 
destination  list  (for  multicast),  may  be  present  but  do  not  play  any 
part  in  this  protocol. 

•  Each  node  maintains  a  list  of  positive  and  negative  acknowledgment 
message  identifiers.  Whenever  it  broadcasts  a  message,  it  appends 
this  list  of  acknowledgments  to  the  message,  and  then  clears  its  list. 

•  When  a  node  receives  a  message  it  has  not  previously  received  in 
an  uncorrupted  state,  it  adds  the  identifier  as  an  acknowledgment  to 
its  list.  If  the  message  is  uncorrupted,  the  identifier  is  added  as  a 
positive  acknowledgment;  if  the  message  is  corrupted,  but  with  an 
uncorrupted  header,  the  identifier  is  added  as  a  negative  acknowledg¬ 
ment. 

•  Vv  hen  a  node  sees  a  positive  acknowledgment  appended  to  a  message 
that  it  receives,  it  deletes  from  its  own  list  any  positive  acknowledg¬ 
ment  for  that  message.  When  it  sees  a  negative  acknowledgment  for  a 
message,  it  deletes  from  its  list  any  acknowledgment  for  that  message, 
whether  positive  or  negative. 

•  When  a  node  sees  a  positive  acknowledgment  for  a  message  that  it 
has  not  received,  it  adds  a  negative  acknowledgment  to  its  list. 

•  If  a  node  has  no  messages  pending,  it  may  be  necesscuy  to  construct 
a  null  message  to  carry  acknowledgment  messages.  The  acceptable 
delay  before  transmitting  a  null  message  may  differ  for  positive  and 
negative  acknowledgments. 


•  When  a  node  receives  a  negative  acknowledgment  for  one  of  its  mes¬ 
sages,  or  has  received  no  positive  acknowledgment  within  some  time 
interval,  it  retransmits  the  message.  The  retransmission  must  be 
identical  to  the  prior  transmission,  and  thus  must  carry  with  it  ex¬ 
actly  the  same  acknowledgments,  positive  and  negative,  carried  by 
the  prior  transmission  of  that  message. 

That  part  of  the  TRANS  protocol  described  above  is  called  transmission 
control.  Transmission  control  is  the  set  of  rules  used  by  a  host  to  decide 
which  acknowledgments  are  required  and  when  it  should  reissue  messages. 
One  of  the  main  functions  of  transmission  control  is  to  ensure  liveness:  a 
message  must  be  retransmitted  whenever  there  is  doubt  that  it  has  been 
received  by  all  hosts.  The  task  of  determining  whether  all  hosts  have  def¬ 
initely  received  a  particular  message  (so  that  the  sending  host  may  take 
the  irrevocable  step  of  discarding  the  message)  is  the  responsibility  of  a 
companion  algorithm  known  as  reception  analysis.  Although  they  appear 
to  be  separate,  the  transmission  control  and  reception  analysis  algorithms 
must  cooperate  in  order  for  the  protocol  to  be  implemented  correctly  and 
efficiently.  Transmission  control  must  cause  all  messages  to  be  reliably  de¬ 
livered  to  all  hosts.  It  must  also  provide  enough  information  in  the  message 
traffic  to  permit  reception  analysis  to  be  performed  correctly.  In  particular, 
messages  cannot  be  removed  before  they  have  been  received  everywhere  and 
they  should  be  removed  as  soon  as  they  have  been  received  everywhere.  It 
would  be  advantageous  if  the  information  in  the  message  traffic  also  allowed 
reception  analysis  to  be  performed  very  efficiently. 

The  reception  analysis  algorithm  for  Trans  is  based  on  Theorem  2 
of  Part  5.  The  statement  of  that  Theorem  given  in  the  ezu’lier  section  is 
not  completely  accurate  (and  inconsistent  with  the  picture  presented  there). 
The  host  doing  the  analysis  must  follow  paths  through  the  acknowledgment 
graph  starting  from  a  set  of  nodes  representing  messages  sent  by  the  host 
being  checked.  It  cannot  just  check  paths  resulting  from  the  last  message 
received.  The  revised  wording  of  Theorem  2  reads  as  follows: 

Theorem  2 

If  there  exists  a  path  of  positive  acknowledgments  or  retrans¬ 
missions  to  message  Z  from  messages  sent  by  host  T  and  no 
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negative  acknowledgment  has  been  issued  for  any  message  on 
the  path  by  T  or  by  any  message  acknowledged  directly  or  in¬ 
directly  by  T  then  T  has  received  message  Z  correctly.  □ 

In  order  to  construct  a  message  reception  algorithm  based  on  Theorem  2, 
it  is  necessary  that  each  host  should  construct  an  “acknowledgment  graph” 
whose  nodes  aure  messages  and  whose  axes  indicate  acks,  nacks,  or  retrans¬ 
missions.  A  later  section  of  this  specification  describes  how  the  graph  is 
constructed.  The  algorithm  for  analyzing  the  acknowledgment  graph  based 
on  Theorem  2  is  the  following: 

•  Assume  host  S  has  sent  message  Z  and  requires  confirmation  that  T 
received  it. 

•  S  must  have  observed  message  Mj,  broadcast  by  T  prior  to  the  broad¬ 
cast  of  Z. 

•  Mj,  . . . ,  Mn  are  messages  broadcast  consecutively  by  T  after  Mi. 

•  Node  S  constructs  the  acknowledgment  graph  starting  with  M2,  and 
adding  M3  . . .  incrementally. 

•  The  leaves  of  the  graph  must  be  messages  prior  to  Z. 

•  If  any  part  of  the  graph  cannot  be  constructed  then  it  is  undetermined 
whether  the  message  Z  has  been  received  by  the  host  T  and  the 
algorithm  fails. 

•  If  any  one  version  of  the  graph  satisfies  Theorem  2,  the  message  heis 
been  received. 

The  specifications  of  the  transmission  control  and  reception  analysis 
algorithms  of  the  TRANS  protocol  given  above  were  found  to  require  con¬ 
siderable  development  and  interpretation  during  the  implementation  effort. 
The  impleme..tation  was  finally  based  on  the  descriptions  given  above  and 
a  set  of  assumptions  governing  their  interpretation.  During  the  implemen¬ 
tation,  several  additional  problems  with  the  specification  were  uncovered. 
Solutions  for  some  of  these  problems  were  incorporated  in  the  program, 
but  others  were  discovered  so  recently  that  there  was  insufficient  time  to 
implement  them. 
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6.2.1  Clarifications  and  Interpretations 

The  following  clarifications  and  interpretations  were  developed  during  our 
prototype  implementation  and  apply  to  the  TRANS  protocol  as  described 
above. 

•  There  are  two  timeout  values  in  the  transmission  control  section  of 
the  protocol,  ff  a  host  transmits  a  message  and  does  not  receive  an 
acknowledgment  of  any  kind  within  a  certain  time  period  then  it  will 
retransmit  the  message.  This  timeout  will  be  called  the  message  time¬ 
out.  If  a  host  creates  a  pending  acknowledgment  and  does  not  have  a 
genuine  outgoing  message  to  attach  it  to  within  a  certain  time  period, 
then  it  will  create  a  null  message  to  carry  the  acknowledgments.  This 
timeout  will  be  called  the  no-message  timeout. 

•  Unless  stated  otherwise,  when  the  term  “a  message”  is  used  in  the 
trzinsmission  control  section,  it  means  any  version  of  a  message.  A 
message  identifier,  however,  is  a  particular  [host  name,  sequence  num¬ 
ber,  version  number]  triple.  The  distinction  is  between  a  conceptual 
inc..sage  that  is  unchanged  regardless  of  the  version  being  considered 
and  a  particular  transmission  of  a  message.  In  the  remainder  of  this 
report,  the  term  transmission  will  be  used  to  refer  to  a  particular 
version  of  a  message. 

•  Related  to  the  previous  item,  if  a  host  receives  one  version  of  a  mes¬ 
sage  uncorrupted  then  any  other  versions  of  that  message  that  are 
received  uncorrupted  axe  not  a  new  message. 

•  Acks  and  nacks  must  refer  to  specific  versions  (i.e.,  transmissions)  of 
messages.  To  see  this,  suppose  it  were  not  so,  eind  suppose  that  a  host 
transmits  a  message  X  and  later  retransmits  it  in  response  to  a  nack. 
It  is  the  sender’s  responsibility  to  keep  retrmsmitting  the  message 
until  the  host  that  sent  the  nack  receives  it  correctly.  Accordingly,  it 
must  restart  its  message  timeout  timer  and  keep  retransmitting  until 
it  receives  an  acknowledgment  of  some  kind..  Now  suppose  a  belated 
ack  arrives  for  the  original  transmission.  In  the  absence  of  version 
numbers,  the  sender  might  assume  that  this  ack  is  acknowledging  its 
retransmission  and  it  will  therefore  turn  off  its  message  timer  and 
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stop  further  retransmissions — even  though  those  who  needed  them 
have  not  acknowledged  and  may  not  have  received  them. 

•  Related  to  the  item  above:  a  host  will  not  retransmit  a  message  if 
the  version  nTimber  of  the  nack  it  receives  is  less  than  the  version 
number  of  the  most  recent  transmission  of  the  message.  This  “old” 
nack  can  occur  for  a  variety  of  reasons  such  as  if  the  host  missed  the 
first  transmission  of  the  message  that  carried  the  nack  or  if  the  host 
that  sent  the  nack  missed  a  retransmission  of  the  original  message. 
The  purpose  of  the  nack  was  to  cause  a  retransmission  of  the  original 
message.  This  has  already  occurred  and  the  normal  operation  of  the 
protocol  will  cause  the  retransmission  to  be  delivered  to  the  host 
that  issued  the  nack  or  another  nack  to  be  sent  causing  yet  another 
retransmission.  Responding  to  an  “old”  nack,  however,  can  cause  the 
unnecessary  replay  of  a  sequence  of  messages. 

•  A  host  will  ignore  a  transmission  of  a  message  if  it  has  previously  seen 
that  message  uncorrupted.  This  implies  that  it  will  not  examine  the 
acknowledgments  carried  by  the  retransmission  and  it  is  for  this  rea¬ 
son  that  retransmissions  must  carry  exactly  the  same  acks  and  nacks 
as  their  originals.  Another  comment  in  the  original  specification  of 
TRANS:  “It  is  permissible,  but  not  essential,  for  a  node  to  broadcast 
a  positive  acknowledgment  for  a  message  that  it  had  already  received 
uncorrupted” ,  does  not  appear  meaningful  because  the  transmission 
control  rules  do  not  allow  a  host  to  create  two  acks  for  the  same 
message. 

•  A  host  is  only  required  to  ack  a  message  once  (directly  or  indirectly) , 
not  each  version. 

•  The  rule  “when  a  node  sees  a  positive  acknowledgment  for  a  message 
that  it  has  not  received,  it  adds  a  negative  acknowledgment  to  its 
list”  is  interpreted  to  mean  that  1)  a  host  only  needs  to  send  one 
nack  for  a  message  that  it  has  not  received,  not  one  for  each  version 
of  the  message,  and  2)  if  a  host  has  a  pending  nack  for  a  message, 
then  it  does  not  need  to  add  another  if  ‘  1;  sees  an  ack  for  a  different 
version  of  the  message.  These  etssumptions  follow  from  the  previous 
assumptions  about  messages  and  transmissions. 
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•  The  statements  “Node  S  constructs  the  acknowledgment  graph  start¬ 
ing  with  M2,  and  adding  M3  . . .  incrementally”  and  “If  any  one  ver¬ 
sion  of  the  graph  satisfies  Theorem  2,  the  message  has  been  received” 
imply  that  the  following  procedure  is  used  for  reception  analysis.  A 
group  of  nodes  are  added  to  the  acknowledgment  graph  as  a  result  of 
the  first  message  transmitted  by  T  after  Z  was  transmitted,  in  this 
case  M2,  and  the  transmissions  that  it  acknowledges  (directly  or  in¬ 
directly).  The  reception  analysis  algorithm  is  then  applied.  If  it  fails, 
then  nodes  are  added  as  a  result  of  the  next  message  transmitted  by 
T  and  the  algorithm  is  applied  again.  This  continues  until  the  algo¬ 
rithm  succeeds,  the  host  runs  out  of  messages,  or  part  of  the  graph 
cannot  be  constructed. 

•  Messages  from  a  host  that  follow  a  specific  message  M  from  that 
host  do  not  ahect  the  conditions  that  can  be  drawn  by  analyzing 
the  messages  up  to  M  from  that  host  and  the  messages  that  they 
acknowledge.  This  is  implied  by  the  statement  “if  any  one  version  of 
the  graph  satisfies  Theorem  2,  then  the  message  has  been  received.” 

•  The  statement  “If  any  part  of  the  graph  camnot  be  constructed  then 
the  algorithm  fails”  implies  that  amalysis  must  stop  if  an  acknowl¬ 
edgment  is  encoimtered  for  an  unknown  message  (a  message  that  the 
host  performing  the  analysis  has  not  seen)  or  if  there  is  a  gap  in  the 
sequence  of  messages  from  the  host  being  checked. 

•  The  sequence  numbers  issued  by  a  host  follow  a  regular  pattern  and 
au'e  not  just  imique  identifiers.  Otherwise,  a  host  would  not  be  able  to 
detect  that  there  was  a  gap  in  the  messages  that  it  has  received  from 
another  host  and  reception  analysis  would  not  be  usable  ais  described. 
This  b  based  on  the  need  to  detect  gaps  mentioned  in  the  previous 
assumption. 

•  The  statement  “S  must  have  observed  message  Mi,  broadcast  by  T 
prior  to  the  broadcast  of  Z”  refers  to  a  message  actually  seen  by  S 
prior  to  broadcasting  Z  and  not  to  the  message  that  was  actually  last 
broadcast  by  T  before  S  broadcast  Z.  S  may  not  know  the  identity 
of  the  message  that  was  actually  last  because  it  may  have  mbsed  it 
due  to  an  error.  Similarly,  M2  is  not  necessarily  the  first  message 
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broadcast  by  T  after  S  broadcast  Z  (it  could  have  preceded  Z).  S 
knowing  the  identities  of  the  messages  from  T  whose  broadcasts  pre¬ 
ceded  and  followed  the  broadcast  of  Z  would  improve  the  efficiency 
of  the  analysis  but  is  not  necessary. 

6.2.2  Comments 

This  section  contains  comments  about  the  TRANS  protocol.  Its  purpose  is 
to  illuminate  some  of  the  properties  of  the  protocol. 

During  reception  analysis,  a  nack  carried  by  a  message  acts  as  a  barrier. 
No  information  along  the  path  indicated  by  the  nack  could  have  been  known 
to  the  host  that  issued  the  nack  unless  it  can  be  reached  by  a  different  path. 
If  the  host  knew  about  this  other  information,  then  the  message  carrying 
the  nack  would  have  also  carried  acks  for  these  other  transmissions. 

When  a  nack  is  encountered  during  reception  analysis,  the  host  doing 
the  analysis  must  assume  that  the  host  being  checked  did  not  know  any 
information  along  the  path  indicated  by  the  nack,  even  if  this  is  not  true. 
The  host  doing  the  analysis  just  cannot  tell  from  the  available  information. 
The  difference  from  the  previous  paragraph  is  that  the  message  carrying 
the  nack  may  not  have  been  transmitted  by  the  host  being  checked.  It  may 
just  have  been  encountered  along  an  acknowledgment  path. 

Nacks  turn  out  to  be  self-healing.  A  nack  will  cause  a  new  chain  of  acks 
to  be  started,  which  will  eventually  detour  around  the  site  of  the  error.  So 
even  if  information  indicating  a  valid  reception  is  discarded  when  a  host 
issues  a  nack,  it  will  eventually  be  recovered  due  to  the  actions  initiated 
by  the  nack. 

When  a  host  removes  a  pending  ack  for  a  message  M  because  of  a  nack 
for  M  carried  by  another  message,  it  agrees  to  wait  aind  indirectly  ack  a 
future  versioi.  of  M.  It  will  not  create  another  pending  ack  for  a  version  of  M 
because  it  has  already  received  an  uncorrupted  version  of  M.  Removing  the 
ack  saves  transmitting  an  acknowledgment,  but  it  will  probably  increase 
reception  latency. 

With  respect  to  transmission  control,  a  nack  for  version  v  of  a  message 
indicates  to  the  host  that  originated  the  message  that  a  subset  of  the  other 
hosts  have  not  seen  any  version  of  the  message  up  to  v  ai'd  that  a  new 
transmission  is  necessary.  In  effect,  a  new  transmission  due  to  a  nack 
stauts  a  new  round  of  acknowledgment  for  the  message  with  a  smaller  set 
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of  receivers.  Hosts  that  have  seen  earlier  versions  of  the  message  will  ignore 
the  new  transmission,  while  the  remaining  hosts  will  try  to  ack  it  (directly 
or  indirectly).  (Note  that,  according  to  the  previous  paragraph,  hosts  that 
have  seen  the  message  before  may  still  have  to  ack  it  indirectly.) 

A  host  cannot  remove  a  message  until  it  has  been  received  by  all  other 
hosts.  Initially,  it  might  appear  that  a  message  can  be  unremovable  for 
an  Indeterminant  amotmt  of  time  because  some  host  might  not  be  broad¬ 
casting  messages  or  might  be  sending  messages  without  acknowledgments. 
In  practice,  neither  of  these  conditions  can  occur  for  am  appreciable  length 
of  time.  If  they  do  persist,  then  there  is  a  serious  problem  with  the  host 
or  the  network  that  should  be  handled  by  other  mechamisms.  Note  that 
determining  that  a  message  has  been  received  by  a  particular  host  can  take 
an  indeterminant  amount  of  time,  but  it  would  be  for  different  reasons  such 
as  am  unfortunate  sequence  of  errors  that  cause  key  messages  to  be  missed. 

Every  host  must  broadcast  a  message  in  a  bounded  amount  of  time  un¬ 
less  there  is  a  serious  problem  in  the  network.  This  time  period  depends  on 
the  values  of  the  protocol  timeouts  and  the  number  of  errors  that  will  be 
tolerated  until  a  network  problem  is  considered  to  exist.  Consider  a  host 
H.  Assume  that  H  was  the  last  host  to  broadcast  a  message.  If  no  acknowl¬ 
edgment  for  the  message  is  received  within  a  message  timeout  period,  then 
H  will  retransmit  the  message.  If  a  nack  is  received  then  H  will  automat¬ 
ically  retransmit  the  message.  If  an  ack  is  received  then  H  will  create  a 
pending  ack  for  the  new  message.  If  a  client  message  is  not  received  within 
a  no-message  timeout  period,  then  H  will  create  a  null  message  to  carry 
the  pending  ack.  Now  assume  that  H  was  not  the  last  host  to  broadcast  a 
message.  If  H  saw  the  message  then  it  has  a  pending  acknowledgment  for 
it.  If  a  client  message  is  not  received  within  a  no-message  timeout  period, 
then  H  will  create  a  null  message  to  cany  the  pending  acknowledgment. 
If  H  did  not  see  the  message  then  the  original  host  will  retransmit  the 
message  after  a  message  timeout  period  because  no  host  responded  with 
an  acknowledgment  or  another  host  will  send  a  message  acknowledging  the 
first  message.  This  new  message  will  now  be  the  last  message  and  H  will 
respond  to  it  in  the  same  way.  If  H  does  not  respond  to  some  number  of 
messages  in  a  row  then  there  is  a  serious  problem  in  the  system. 

Once  a  host  has  a  pending  acknowledgment  it  cannot  simply  remove  that 
acknowledgment  because  of  information  seen  in  a  new  message.  Instead, 
it  must  replace  that  acknowledgment  with  a  different  one  depending  on 
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the  contents  of  the  new  message.  Thus,  once  a  host  creates  a  pending 
zw:knowledginent,  it  must  send  a  message  and  that  message  must  carry  an 
acknowledgment.  If  no  client  message  arrives  it  must  create  a  null  message 
when  the  no-message  timer  goes  off. 

The  Trans  protocol  causes  a  never-ending  sequence  of  messages  to 
occur.  Even  if  no  client  messages  are  being  received  anywhere,  the  protocol 
will  continue  sending  null  messages  or  retransmissions  to  respond  to  the  last 
message  sent.  Usually,  there  will  be  a  lot  of  activity  in  the  system  because 
many  hosts  will  be  broadcasting  client  messages.  This  helps  overcome  the 
effect  of  a  few  errors. 

Each  message  sent  by  a  host  must  in  general  carry  an  acknowledgment 
for  at  least  one  other  message.  This  acknowledgment  will  then  indirectly 
acknowledge  a  series  of  other  messages.  The  main  reason  that  there  must 
be  em  acknowledgment  is  that  once  a  host  hzis  a  pending  acknowledgment 
it  m\ist  send  it  or  replace  it  with  another  acknowledgment,  as  discussed 
above. 

It  is  possible  for  a  message  to  be  sent  without  any  acknowledgments. 
This  can  occur  if  a  host  sends  two  messages  before  another  host  has  sent 
any.  Since  the  first  message  carried  all  of  the  host’s  pending  acknowledg¬ 
ments,  the  second  will  not  carry  any.  In  some  sense,  the  second  message 
is  implicitly  acknowledging  the  first  message.  Another  way  this  can  occur 
is  if  a  host  issues  a  message  after  missing,  due  to  errors,  all  messages  that 
were  sent  from  the  time  it  sent  its  previous  message.  Once  again,  the  first 
message  will  carry  all  of  the  pending  acknowledgments  8ind  the  second  will 
not  carry  any.  Neither  of  these  situations  cam  occur  for  long  unless  the 
network  is  having  serious  problems.  (The  first  message  trzinsmitted  also 
does  not  carry  amy  acknowledgments.) 

6.2.3  Problems 

Several  problems  with  the  TRANS  specification  were  uncovered  during  the 
implementation.  This  section  identifies  problems  that  were  addressed  in 
the  implementation. 

•  The  specification  indicates  that  a  message  will  be  retransmitted  if  no 
acknowledgment  is  received  for  that  message  before  a  message  time¬ 
out  occurs.  Presumably  additional  retransmissions  will  be  made  if 
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additional  message  time  out  periods  elapse  without  an  acknowledg¬ 
ment  being  received.  The  protocol  does  not  specify  what  happens  if 
the  timeout  is  satisfied  by  receipt  of  <m  acknowledgment,  but  then 
a  retransmission  occurs  because  of  a  nack.  Should  a  new  timeout 
period  be  started  or  is  it  unnecesstiry?  It  turns  out  that  a  new  time¬ 
out  period  must  be  started  whenever  a  new  version  of  a  message  is 
transmitted. 

This  can  be  seen  from  the  following  coimter-example.  Let  there  be 
three  hosts  A,  B,  and  C.  A  sends  a  message  W  that  is  seen  by  B  and 
missed  by  C.  B  sends  a  message  X  that  carries  an  ack  for  W.  A  and 
C  see  X.  A  turns  off  its  message  timer  for  W.  C  sends  a  message  Y 
that  carries  an  ack  for  X  and  a  nack  for  W.  A  retransmits  W  as  W'. 
B  ignores  W'  and  C  misses  it.  A  now  sends  Z  which  carries  an  ack 
for  Y.  C  sees  Z  and  turns  off  its  message  timer  for  Y.  At  this  point,  C 
has  not  seen  W  and  W  can  no  longer  be  retransmitted.  It  will  not  be 
automatically  sent  because  its  message  timer  is  off.  The  one  message 
that  carries  a  nack  for  W,  Y,  will  also  no  longer  be  retransmitted. 
Y’s  message  timer  is  also  off,  and  all  other  hosts  saw  Y  so  a  nack  will 
never  be  issued  for  it. 

•  According  to  the  previoixs  item,  there  is  a  sequence  of  timeout  periods 
for  a  message,  some  started  because  no  acknowledgments  have  been 
received  and  some  started  because  of  nacks.  As  was  mentioned  in  the 
comment  section  above,  each  retransmission  issued  becaiise  of  a  nack 
starts  a  new  round  of  acknowledgments  for  a  message.  A  problem 
will  occur  if  a  message  timer  is  turned  off  because  of  an  ack  for  an 
earlier  round.  That  is,  if  the  last  round  was  stated  because  of  a  nack 
for  version  v  of  a,  message  then  the  message  timer  cannot  be  turned 
off  for  an  ack  for  a  version  w  of  the  message  where  w  <  v.  An  “old” 
ack  for  some  message  M  can  occur  if  the  host  that  issued  M  missed 
previous  transmissions  of  the  message  cairrying  the  ack  and  is  now 
seeing  it  for  the  first  time. 

The  problem  that  occurs  is  that  the  message  may  not  have  been  seen 
by  some  hosts  but  will  not  be  retransmitted  because  of  the  message 
timer  or  a  nack.  This  can  be  seen  in  the  following  scenario.  Let  there 
be  three  hosts  A,  B,  and  C.  A  sends  a  message  W  that  is  seen  by  B 


and  missed  by  C.  B  sends  a  message  X  that  carries  an  ack  for  W, 
X  is  seen  by  C  and  missed  by  A.  C  sends  a  message  Y  that  carries 
an  ack  for  X  and  a  nack  for  W.  Y  is  seen  by  A  and  missed  by  B.  A 
retransmits  W  as  W'.  B  ignores  W'  and  C  misses  it.  Now  the  message 
timer  for  X  goes  off  and  B  retransmits  it  as  B'.  B'  is  seen  by  A  and 
ignored  by  C.  A  now  sees  the  ack  for  W  and  turns  off  W’s  message 
timer.  At  this  point,  C  has  not  seen  any  version  of  W  and  W  can  no 
longer  be  retransmitted.  It  will  not  be  automatically  sent  because  its 
message  timer  is  off.  In  addition,  the  only  nack  for  W  is  for  an  earlier 
version  and  will  be  ignored. 

•  The  specification  given  earlier  states;  “Each  message  carries  with  it, 
one  or  more  acknowledgments  to  previous  messages.”  This  is  not 
true  2is  was  discussed  in  the  comments  section  above.  Later,  in  the 
reception  analysis  section,  it  states  that  “The  leaves  of  the  graph  must 
be  messages  prior  to  Z.”  This  is  also  not  true  for  the  same  reasons. 

•  Theorem  2  contains  the  clause  “and  no  negative  acknowledgment  has 
been  issued  for  any  message  on  the  path  by  T  or  by  any  message  ac¬ 
knowledged  directly  or  indirectly  by  T.”  Initially,  it  was  assumed  that 
the  phrase  “or  by  any  message  acknowledged  directly  or  indirectly  by 
T”  meant  that  there  would  be  a  chain  of  positive  acknowledgments 
leading  from  a  message  issued  by  T  to  a  message  that  contained  a 
negative  acknowledgment.  It  turns  out  that  this  is  insufficient  and 
that  some  negative  acknowledgments  can  be  missed,  allowing  mes¬ 
sage  reception  to  be  falsely  detected.  The  phrase  should  read  “or  by 
any  message  acknowledged  directly  or  indirectly  by  T  with  a  chain 
of  positive  or  negative  acknowledgments  starting  with  a  positive  ac¬ 
knowledgment.” 

This  problem  can  be  seen  in  the  example  shown  in  Figure  6.3.  Assume 
the  messages  shown  in  the  Figure  were  each  sent  by  a  different  host 
and  Let  H{x)  denote  the  host  that  sent  message  z.  Given  the  infor¬ 
mation  in  the  Figure,  can  H(Z)  deduce  that  H(U)  saw  Z?  (Assume 
that  H(X)  did  not  see  Y  and  that  H(Y)  did  not  see  X  when  X  and 
Y  were  transmitted.)  According  to  the  discussion  above,  H(Z)  would 
not  be  able  to  conclude  that  H(U)  had  seen  Z.  Although  there  is 
a  positive  acknowledgment  chain  U-^W—*X—^Z  from  U  to  Z,  it  is 
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Figure  6.3:  Reception  Analysis  must  stzirt  from  am  ack 

negated  by  the  acknowledgment  chain  U-*W-/^Y-/^Z  (where  -+  and 
7^  indicate  acks  amd  nacks,  respectively.)  As  shown  in  the  following 
discussion,  H(Z)  can  plausibly  deduce  both  that  H(U)  has  seen  and 
not  seen  Z.  It  m\ist  assume  the  worst  and  assume  thac  H(U)  has  not 
received  Z. 

Case  1.  Assume  that  H(U)  has  seen  Z. 

Assume  that  H(U)  first  sees  Y.  It  will  replace  its  ack  for  Z  with 
an  ack  for  Y.  Assume  that  H(U)  then  sees  X.  It  will  then  create 
am  ack  for  X.  When  H(U)  sees  W,  it  will  replace  the  acks  for  X 
amd  Y  with  am  ack  for  W.  U  can  then  be  issued  with  only  an 
ack  for  W. 

Case  2.  Assume  that  H(U)  hais  not  seen  Z. 

Assume  that  H(U)  first  sees  X.  It  will  then  create  am  ack  for  X, 
amd  a  nack  for  Z.  Assume  that  H(U)  then  sees  Y.  It  will  then 
replace  its  nack  for  Z  with  an  ack  for  Y.  When  H(U)  sees  W, 
it  will  replace  the  acks  for  X  amd  Y  with  am  ack  foT  W.  U  can 
then  be  issued  with  only  am  ack  for  W. 

•  Theorem  2  gives  contradictory  information  about  the  effect  of  a  nack 
on  a  path  of  positive  acknowledgments  or  retransmissions.  In  the  ex¬ 
ample  shown  in  Figure  6.4,  the  positive  acknowledgment  path  W -~*X—*Z 
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Figure  6.4:  Positive  Acknowledgment  Path  Invalidated  by  nack 

should  be  invalidated  by  the  acknowledgment  path  W  —*Y -f^Z .  How¬ 
ever,  in  the  two  examples  shown  in  Figure  6.5,  the  positive  acknowl¬ 
edgment  paths  W—*Z''^Z  and  U-*Z''^Z  (where  indicates  re¬ 
transmission)  should  not  be  invalidated  by  the  acknowledgment  paths 
W -*Y -/^Z.  In  the  second  (rightmost)  example  of  Figure  6,5  W  and  X 
are  issued  by  the  same  host,  with  W  preceding  X.  This  problem  could 
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Figure  6.5:  Positive  Acknowledgment  Paths  nol  Invalidated  by  nack 

be  solved  by  carefully  rewording  the  theorem,  but  it  is  not  clecir  why 
the  path  must  return  to  the  initial  version  of  Z.  Another  approach  is 
to  drop  retransmission  arcs  from  the  graph  and  reword  the  theorem 
to  read  “If  there  exists  a  path  of  positive  acknowledgments  to  message 
Z  or  one  of  its  retransmissions  from  messages  sent  by  node  T  . . .  has 
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received  message  Z  correctly.”  A  retransmission  of  a  message  carries 
exactly  the  same  acknowledgments  as  the  original  message  so  it  can 
be  used  directly  to  continue  analysis  of  the  graph. 


6.3  Implementation 

The  protocol  was  implemented  on  a  netw'ork  of  Sim  workstations  connected 
by  an  Ethernet.  Two  programs  were  written,  one  for  the  protocol  and  one 
for  a  driver  used  to  exercise  the  protocol.  The  programs  were  written  in  C 
£ind  run  under  the  UNIX  operating  system.  Each  workstation  contains  a 
protocol  and  a  driver  process.  The  driver  periodically  sends  a  message  to 
its  protocol  process,  which  then  broadcasts  the  message  to  other  protocol 
processes.  The  protocol  program  consists  of  two  main  sections:  one  for 
transmission  control  and  the  other  for  reception  analysis.  Initial  perfor¬ 
mance  measurements  were  made  with  two  host  sets  and  four  error  levels. 

The  program  is  written  moduloxly  so  that  alternate  protocol  rules  can 
be  examined.  In  particular,  the  reception  analysis  section  is  completely 
separate  from  the  rest  of  the  program  and  can  be  replaced  by  more  efficient 
versions  in  the  future.  There  are  a  variety  of  program  options  that  Ccin  be 
set  when  the  protocol  programs  are  started.  This  permits  a  wide  variety  of 
configurations  to  be  set  up  for  experimentation. 

Reception  analysis  could  not  be  implemented  directly  from  the  speci¬ 
fication  given  in  the  original  description  of  TraNS.  As  explained  earlier, 
that  description  mixes  existence-proof  arguments  with  implementation  di¬ 
rections,  ajid  does  not  describe  an  efficient  algorithm  for  building,  main¬ 
taining,  zmd  exzimining  the  acknowledgment  graph.  In  addition,  certain 
implementa^-ion  details  were  missing — such  as  an  indication  of  when  graph 
information  can  be  d’sccirded.  It  was  decided  to  maintain  an  up-to-date 
graph  and  to  add  information  as  transmissions  arrived.  When  information 
was  added  to  the  grap..,  the  appropriate  analysis  would  be  performed  to 
determine  whether  new  receptions  could  be  confirmed.  Information  would 
be  remrved  from  the  graph  as  soon  as  it  was  no  longer  needed.  Specific 
version  numbers  must  be  indicated  for  information  in  the  acknowledgment 
graph. 
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6.3.1  Top-Level  Design 

A  prototype  version  of  the  TRANS  protocol  was  implemented  to  run  on  a 
network  of  Sun  workstations  connected  by  an  Ethernet,  The  workstations 
were  running  the  Sun  3.4  version  of  the  UNIX  operating  system.  Although 
Trans  is  a  link-level  protocol,  it  was  implemented  at  the  presentation 
layer.  This  greatly  simplified  the  implementation  while  still  allowing  a 
realistic  and  thorough  evaluation. 

The  protocol  was  written  in  the  C  programming  language  and  was  run 
as  a  single  process.  A  second  program  called  the  driver  was  also  written  in 
C  and  was  also  run  as  a  single  process.  Each  workstation  contained  a  single 
protocol  process  and  a  single  driver  process.  A  driver  represented  the  set 
of  clients  on  a  workstation  using  TRANS,  cind  sent  broadcast  messages  to 
the  protocol  process  on  its  workstation.  The  drivers  sent  messages  with 
a  Poisson  inter-arrival  rate.  When  a  protocol  process  received  a  message 
from  its  driver,  it  broadcast  the  message  to  the  other  protocol  processes. 

Communication  between  the  processes  was  accomplished  with  the  User 
Datagram  Protocol  (UDP).  Each  protocol  process  was  connected  to  its 
driver  through  a  port  bound  to  the  host's  address  and  the  set  of  proto¬ 
col  processes  were  connected  through  a  port  bound  to  the  broadcast  ad¬ 
dress.  Messages  were  broadcast  among  the  protocol  processes  with  the 
UDP  broadcast  mechanism.  No  errors  were  simulated  for  communication 
between  a  driver  and  its  protocol  process.  An  error  rate  could  be  individ¬ 
ually  set  for  each  protocol  process,  however,  to  control  message  reception 
from  other  protocol  processes. 

6.3.2  Implementation  Decisions 

The  following  decisions  were  made  when  the  protocol  was  implemented: 

•  It  would  not  be  possible  for  a  host  to  receive  a  message  with  a  cor¬ 
rupted  body  and  an  uncorrupted  header.  A  message  would  either  be 
fully  received  or  not  received  at  all. 

•  The  length  of  the  no-message  timeout  would  be  the  same  for  pending 
acks  and  nacks. 

•  A  list  of  old  nacks  would  not  be  maintained.  An  old  nack  refers  to 
a  tramsmiss.  n  for  which  the  current  host  previously  issued  a  nack. 


177 


As  a  result,  a  host  can  issue  multiple  (redundant)  nacks  for  the  same 
transmission.  (A  host  must  issue  a  new  nack  for  a  new  version  of  a 
message  that  it  has  not  seen  to  indicate  that  it  missed  that  retrans¬ 
mission  too.) 

6.3.3  Data  Structures 

Transmission  Control 

The  data  structures  for  transmission  control  are  straightforward.  A  list 
for  pending  acks  and  a  list  for  pending  nacks  must  be  maintained  until  a 
message  is  sent  and  these  lists  can  be  cleared.  A  list  of  pending  messages 
must  also  be  maintained.  These  are  messages  that  have  been  sent  by  the 
host  but  have  not  yet  been  verified  as  received  by  all  other  hosts.  One 
additioncd  list  is  maintained  of  old  acks.  This  list  represents  messages  that 
have  been  seen  by  the  host  in  an  uncorrupted  form.  Any  message  the  host 
has  seen,  whether  or  not  the  host  actually  issued  an  ack  for  it,  is  represented 
in  this  list  (except  those  still  in  the  pending  ack  list). 

The  oid-ack  list  is  actually  represented  in  a  condensed  form.  There  is  a 
separate  number/ linked-list  pair  for  each  host.  The  number  is  the  largest 
message  sequence  number  from  a  host  that  the  current  host  has  seen  such 
that  all  preceding  messages  from  that  host  have  been  seen.  The  linked 
list,  which  is  maintamed  in  ascending  order,  keeps  track  of  more  recent 
messages.  The  first  node  in  the  linked  list  represents  a  missing  message 
and  the  last  node  in  the  linked  list  represents  the  most  recent  message  seen 
from  that  host.  As  gaps  are  filled  in,  the  linked  list  is  condensed  and  the 
latest  consecutive  sequence  number  is  raised.  A  more  sophisticated  data 
structure  will  be  needed  if  sequence  numbers  can  be  reused. 

Acknowledgment  Graph 

Several  data  structures  are  used  to  represent  the  acknowledgment  graph. 
The  main  data  structure  is  the  node  list.  It  represent'’  aph  of  all 
known  messages  that  may  be  needed  to  help  detect  .ception  of  a 

host’s  messages.  Another  structure,  called  the  host  list,  keeps  track  of  the 
messages  that  have  been  seen  from  each  host.  It  is  not  required,  but  it 
simplifies  examination  of  the  node  list. 
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The  node  list  maintains  an  up-to-date  representation  of  the  pertinent 
message  traffic.  Each  node  represents  a  single  message.  When  a  new  mes¬ 
sage  is  sent  or  a  new,  relevant  message  arrives,  a  new  node  is  added  to  the 
list.  A  special  node,  called  an  unknown  nodt,  must  also  be  added  when  a 
message  that  hcis  not  been  seen  before  is  detected  as  an  acknowledgment 
in  another  message.  Unknown  nodes  aire  filled  in  when  their  corresponding 
messages  are  seen.  Nodes  are  removed  from  the  node  list  when  they  are 
no  longer  needed  in  the  message  graph.  Nodes  that  represent  a  host’s  own 
messages  must  aiso  be  kept  until  that  message  has  been  verified  as  received 
by  all  other  hosts. 

Each  node  has  two  linked  lists,  one  for  the  acks  carried  by  the  message 
and  one  for  the  nacks.  Acknowledgments  are  only  added  to  one  of  these 
lists  if  they  are  needed  in  the  graph.  An  acknowledgment  is  not  needed  if  it 
refers  to  a  message  that  has  already  been  seen  but  is  no  longer  represented 
by  a  node  in  the  graph.  These  linked  lists  contain  full  message  identifiers, 
not  pointers  to  nodes. 

The  host  list  is  similar  to  the  old-ack  list  used  for  transmission  control. 
There  is  a  separate  number/linked-Iist  pair  for  each  host.  The  number  is  the 
largest  message  sequence  number  from  a  host  that  the  current  host  has  seen 
such  that  all  preceding  messages  from  that  host  have  been  seen  and  removed 
from  the  node  list.  The  linked  list,  which  is  maintained  in  ascending  order, 
keeps  track  of  more  recent  messages.  A  more  recent  message  can  be  in  one 
of  three  states: 

gap:  not  received  or  referred  to  by  an  acknowledgment. 

gone:  received  and  no  longer  represented  in  the  node  list. 

node:  received  or  referred  to  by  an  acknowledgment  and  still  represented 
in  the  node  list  (including  “unknown”  nodes). 

The  first  message  in  the  linked  list  is  either  “gap”  or  “node” ,  and  the  last 
message  in  the  list  is  either  “gone”  or  “node” .  As  messages  at  the  beginning 
of  the  list  become  “gone”,  the  list  is  condensed  and  the  latest  consecutive 
sequence  number  is  raised.  A  more  sophisticated  data  structure  will  be 
needed  if  sequence  numbers  can  be  reused.  Once  again,  the  linked  lists 
contain  full  message  identifiers,  not  pointers  to  nodes. 
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Reception  Analysis 

Message  reception  analysis  is  performed  with  the  data  structures  that  form 
the  acknowledgment  graph,  A  matrix  is  used  to  keep  track  of  which  mes¬ 
sages  have  been  received  by  which  hosts.  The  columns  of  the  matrix  axe 
the  hosts  in  the  system  and  the  rows  are  the  pending  messages. 

6.3.4  Algorithms 

Transmission  Control 

The  algorithms  for  transmission  control  follow  directly  from  the  rules.  They 
control  what  happens  when  a  message  arrives  from  a  client  or  another  host 
or  a  timer  goes  off.  The  important  transmission  control  algorithms  are: 

New  Client  Message  Arrives 
increment  sequence  number 
add  pending  acknowledgments  to  message 
broadcast  message 
start  message  timer 

turn  off  no-message  timer  (if  necessaxy) 
mzLke  a  node  for  the  message 
add  the  node  to  the  graph 

New  Broadcast  Message  Arrives 

create  a  pending  ack  for  the  message 
start  no-message  timer  (if  necessary) 
process  message  ack  list 
process  message  nack  list 
make  a  node  for  the  message 
if  node  is  needed 

add  node  to  graph 

Process  Message  Ack  List 
if  ack  for  my  message 

if  for  latest  round  for  message 
turn  off  message  timer 
else  if  acked  message  is  unknown 
create  nack  for  unknown  message 


start  no-message  timer  (if  necessary) 
else  if  pending  ack  exists  for  message 
remove  pending  ack 
cancel  no-message  timer  (if  necessary) 

Process  Message  Nack  List 
if  nack  for  my  message 

if  for  most  recent  version  of  message 
start  new  round  for  message 
rebroadcast  message 
restart  message  timer 
else  if  pending  nack  exists  for  message 
remove  pending  nack 
cancel  no-message  timer  (if  necessary) 
else  if  pending  ack  exists  for  message 
remove  pending  ack 
cancel  no-message  timer  (if  necessary) 

Message  Timer  Goes  Off 
rebroadcast  message 
restart  message  timer 

No-message  Timer  Goes  Off 
increment  sequence  number 
create  null  message 

add  pending  acknowledgments  to  message 

broadcast  message 

start  message  timer 

make  a  node  for  the  message 

add  the  node  to  the  graph 

Acknowledgment  Graph 

The  acknowledgment  graph  algorithms  are  use  to  add  nodes  to  and  remove 
nodes  from  the  graph.  When  a  node  is  made  for  a  message,  acknowledg¬ 
ments  au'e  added  only  if  they  refer  to  other  graph  nodes.  An  important 
task  of  the  algorithms  is  creating  and  filling  in  unknown  nodes.  When  an 
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iinknown  node  is  filled  in,  all  hosts  are  checked  to  see  if  any  new  mes¬ 
sage  receptions  can  now  be  detected — the  unknown  node(s)  may  have  been 
blocking  many  paths.  When  a  new  message  is  added,  only  the  host  that 
sent  that  message  must  be  checked.  When  it  is  determined  that  a  message 
has  been  received  by  all  other  hosts,  an  attempt  is  made  to  remove  the  node 
for  that  message  from  the  graph.  Removal  of  a  node  can  cause  a  sequence 
of  node  removals  to  occur.  The  important  algorithms  are: 

Make  a  Node  for  a  Message 

for  each  (n)ack  in  the  message  (n)ack  list 
if  (n)acked  message  is  unknown 

create  vinknown  node  for  (n)  acked  message 
add  unknown  node  to  graph 
add  reference  for  unknown  node  in  host  list 
add  (n)ack  to  node  (n)ack  list 
else  if  (n)  acked  message  has  a  node  in  the  graph 
add  (n)  ack  to  node  (n)  ack  list 

Add  a  Node  to  the  Graph 

if  node  was  unknown  node  in  graph 
fill  in  all  copies  of  iinknown  node 
if  new  node  is  for  a  new  version  of  message 
add  node  to  node  list 
check  all  hosts  for  new  reception 

else 

add  node  to  node  list 

add  reference  for  node  in  host  list 

check  host  for  new  reception 

Remove  Nodes  from  the  Graph 
repeat 

look  for  a  node  that 

1.  has  empty  ack  and  nack  lists,  zind 

2.  if  it  is  my  message,  has  been  received  by  all  hosts 
if  found 

remove  node  from  node  list 

remove  ail  references  to  node  from  other  node’s  ack  and  nack  lists 
mark  as  “gone”  in  host  list 
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until  no  such  node  is  found 
Reception  Analysis 

Detecting  whether  another  host  has  seen  any  of  a  specific  host’s  messages  is 
accomplished  by  examining  the  acknowledgment  graph  in  a  specific  order. 
The  search  starts  with  the  first  message  from  the  other  host  that  has  not 
been  marked  as  “gone.”  It  then  continues  sequentially  through  the  remain¬ 
ing  messages  known  to  have  been  sent  by  that  host.  If  a  gap  is  found  in  the 
node’s  message  sequence  or  am  unknown  node  is  foimd  in  the  graph,  then 
the  search  is  aborted.  This  corresponds  to  detecting  that  a  portion  of  the 
graph  cannot  be  constructed  in  the  reception  analysis  specification.  Any 
receptions  that  have  been  detected  up  until  the  search  is  aborted  au'e  valid. 

The  graph  is  searched  in  two  phases.  First,  all  paths  leading  from  the 
other  host’s  current  message  au'e  examined  for  unknown  nodes  and  bad 
nodes.  Bad  nodes  are  nodes  that  cannot  be  used  in  a  positive  acknowledg¬ 
ment  path  because  they  are  nacked  by  some  acknowledgment  path.  The 
list  of  bad  nodes  continues  to  grow  as  the  analysis  proceeds  through  the 
other  host’s  messages.  If  an  unknown  node  is  foimd  the  search  is  aborted. 
The  second  phase  examines  all  positive  acknowledgment  path  leading  from 
the  other  host’s  current  message.  No  unknown  nodes  should  be  found  now 
because  they  would  have  been  detected  in  the  first  phase.  A  positive  ac¬ 
knowledgment  path  is  abandoned  if  it  leads  to  a  bad  node.  When  one  of 
the  messages  from  the  host  doing  the  analysis  is  found,  it  is  marked  as 
received. 

The  important  algorithms  for  reception  analysis  are: 

Check  AU  Hosts 

for  all  hosts  (except  mine) 
check  host 

Check  Host 

start  with  first  non- “gone”  message  for  host  being  checked 
while  not  “gap”  message 
if  node  is  unknown 
break 

if  any  nack’s  in  node’s  nack  list  are  unknown 
break 
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add  nacks  in  node’s  nack  list  to  bad  list 
find  bad  nodes  starting  from  node’s  ack  list 
if  unknown  node  found  while  seeLTching  for  bad  nodes 
break 

search  for  received  messages  from  node’s  ack  list 
move  to  next  message  for  host  being  checked 
if  any  of  my  messages  were  received  by  all  hosts 
remove  nodes 

Find  Bad  Nodes 

for  each  acknowledgment  on  list  being  checked 
if  unknown  node 
break 

if  one  of  my  messages 

ignore  (this  earlier  message  hats  already  been  checked) 
else  find  bad  nodes  starting  from  node’s  ack  list 
if  unknown  node  found  while  searching  for  bad  nodes 
break 

find  bad  nodes  st<irting  from  node’s  nack  list 
if  unknown  node  found  while  searching  for  bad  nodes 
break 

Search  for  Received  Messages 

for  each  ack  on  node’s  ack  list 
if  in  bad  list 
ignore  ack 

else  if  one  of  my  messages 
mark  as  received 
if  recei/ed  by  all  other  hosts 
set  received-by-all  flag 

else  search  for  received  messages  from  node’s  ack  list 


6.4  Performance  Measurements 

Several  preliminary  performaince  measurements  were  taken  to  obtain  a  gen 
eral  understanding  of  the  behavior  of  the  TRANS  protocol.  Our  main  in 
terests  were  determining  how  much  storage  would  be  required  for  the  ac 
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knowledgment  graph  and  for  reception  analysis,  how  long  it  would  take  for 
a  host  to  detect  that  one  of  its  messages  had  been  received  by  all  other 
hosts,  how  long  the  pending  message  queue  would  grow,  and  how  many 
extra  messages  would  be  required  for  reliable  delivery.  We  also  wanted  to 
see  how  these  values  would  change  as  the  error  rate  or  the  number  of  hosts 
in  the  network  grew. 

The  system  can  be  configured  by  setting  the  number  of  hosts  in  the 
network,  and  for  each  host  1)  the  message  reception  error  rate,  2)  the 
message  timeout  value,  3)  the  no-message  timeout  value,  and  4)  the  average 
client  message  inter-arrival  rate.  The  number  of  hosts  was  set  at  four  and 
eight.  The  error  rate  was  the  szme  for  all  hosts  during  a  trial.  With  four 
hosts,  it  was  set  at  0%,  1%,  5%,  and  10%.  With  eight  hosts,  it  was  set  at 
0%  and  1%.  Relatively  low  error  rates  were  used  because  the  protocol  is 
really  intended  for  networks  with  very  reliable  basic  communication  media 
such  as  an  Ethernet.  The  no-message  timeout  value  was  set  at  1.5  times  the 
average  inter-arrival  rate.  This  ensured  that  there  would  be  times  when  the 
no-message  timer  would  go  off.  The  message  timeout  value  was  set  at  twice 
the  no-message  timeout  value.  The  message  timeout  value  should  probably 
be  set  to  be  greater  than  the  no-message  timeout  value  so  that  hosts  have 
a  chance  to  return  acks  before  the  original  message  is  rebroadcast.  The 
timeout  values  and  the  inter-arrival  rate  were  set  the  same  for  all  hosts,  at 
18,  36,  and  12  seconds  respectively. 

The  results  from  the  trials  Me  shown  in  Tables  6.1  and  6.2. 

The  performance  figures  obtained  were  reassuring  aind  show  that  TRANS 
does  perform  well.  There  were  very  few  surprises,  with  most  values  rising  as 
the  error  rate  grew  or  the  number  of  hosts  grew.  The  extra  messages  that 
were  sent  were  divided  into  those  attributed  to  nacks,  message  timeouts, 
and  no-message  timeouts.  Pending  messages  were  messages  that  had  been 
sent  by  a  host  but  not  yet  detected  as  received  by  all  other  hosts.  Latency 
indicates  the  time  between  when  a  message  is  sent  and  when  it  is  detected  as 
received  by  all  other  hosts,  “Latency  to  remove”  is  the  time  between  when 
a  message  is  sent  and  when  its  related  information  is  actually  discarded. 
This  would  be  expected  to  be  longer  that  the  simple  latency  time  because 
graph  information  sometimes  needs  to  be  kept  in  order  to  resolve  paths  to 
other  nodes. 

The  percentage  of  extra  messages  grew  as  the  error  rate  increased  and 
the  number  of  hosts  increased.  It  was  always  very  low  however  and  com- 
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Number  of  hosts  = 


#  of  client  messages 
#  of  messages  sent 
#  of  extra  messages  sent 
%  extra  messages 


#  extra  messages  due  to: 


nack 

message  timeout 
no-message  timeout 


max.  pendmg  messages 
ave.  pending  messages 


max.  nodes  in  graph 
ave.  nodes  in  graph 


latency  times  in  seconds: 


max.  latency 
ave.  latency 
max.  latency  to  remove 
ave.  latency  to  remove 


error  rates 


5 


102 

172 

18.00 

26.54 

102 

172 

19.61 

30.62 

Table  6.1:  Performajice  Measurements  with  4  Hosts 
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max.  nodes  in  graph 
ave.  nodes  in  graph 

23 

6.72 

64 

8.91 

latency  times  in  seconds: 

max.  latency 

19 

67 

ave.  latency 

14.94 

16.43 

max.  latency  to  remove 

19 

67 

ave.  latency  to  remove 

14.96 

17.12 

Table  6.2:  Performance  Measurements  with  8  Hosts 
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pares  very  favorably  with  point  to  point  protocols  that  would  require  0{N) 
messages  for  every  message  sent.  Here,  the  number  of  extra  messages  was 
0(1).  Even  more  favorable  results  could  have  been  obtained  by  increasing 
the  no-message  timeout  with  respect  to  the  inter-arrival  rate  so  that  client 
messages  could  cany  more  pending  acknowledgments.  This  would  possibly 
cause  more  message  timeouts,  but  they  seem  rare  anyway. 

The  number  of  message  timeouts  increased  ais  the  error  rate  increased. 
This  implies  that  a  few  messages  carrying  key  acknowledgments  were  dropped, 
thereby  forcing  hosts  to  retraunsmit  some  messages. 

The  number  of  pending  messages  grew  with  the  error  rate  but  the  aver¬ 
age  number  was  very  small.  Similarly,  the  average  number  of  nodes  required 
in  the  acknowledgmeiit  graph  was  pleasingly  small,  even  with  an  error  rate 
of  10%.  It  was  encouraging  to  see  that  the  amount  of  storage  required  was 
not  prohibitive. 

The  latency  times  were  also  very  respectable  and  were  close  to  the  client 
message  inter-arrival  time.  The  other  latency  time  of  interest  is  the  time 
between  when  a  message  is  sent  and  when  it  is  actually  received  by  all  other 
hosts.  This  value  was  not  measured  but  it  is  expected  to  be  very  small.  It 
interesting  to  note  that  the  removal  latency  time  is  very  close  to  the  simple 
latency  time.  This  indicates  that  when  a  message  is  detected  as  received  by 
all  hosts  its  related  information  is  basically  ready  for  removal.  This  helps 
prevent  lajge  storage  requirements. 


6.5  Problems  Discovered 

Several  problems  zmd  suspected  problems  were  discovered  in  our  imple¬ 
mentation  of  the  Trans  protocol.  Further  analysis  may  determine  that 
the  suspected  problems  eire  actually  handled  correctly,  but  there  was  in- 
su^cient  time  to  find  solutions  to  the  known  problems  or  examine  the 
cuspected  problems.  It  is  believed  that  the  problems  are  not  too  deep  and 
can  be  solved. 

Trans  seems  to  work  well  in  ideal  situations  where  there  arc  few  errors, 
all  hosts  see  all  messages  in  order,  and  actions  occur  instantaneously.  The 
problems  with  the  protocol  that  were  described  earlier  cind  in  this  section 
arise  from  several  sources: 
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•  Because  of  errors,  it  is  possible  for  a  host  to  see  messages  out  of  order. 
This  means  that  acknowledgments  for  later  versions  of  messages  can 
be  seen  before  acknowledgments  for  earlier  versions,  later  versions 
of  messages  can  be  seen  before  acknowledgments  for  earlier  versions, 
later  messages  from  a  host  czm  be  seen  before  zmy  version  of  an  earlier 
message  from  that  host,  and  so  on.  The  effect  is  that  actions  can  be 
required  before  a  host  knows  the  full  message  history, 

•  Version  numbers  are  important  in  certain  situations.  It  is  not  always 
sufficient  to  consider  which  message  is  involved  before  taking  an  ac¬ 
tion,  sometimes  the  specific  transmission  must  be  considered.  For 
example,  an  ack  for  an  earlier  version  of  a  message  cannot  turn  off  a 
message  timer  for  a  later  roTind. 

•  Reception  analysis  seems  to  require  more  information  to  work  prop¬ 
erly  than  transmission  control.  Both  use  information  acquired  from 
acknowledgments  in  messages  to  perform  their  duties,  but  it  seems 
that  the  acknowledgments  that  are  sufficient  to  perform  transmission 
control  are  insufficient  to  perform  reception  analysis.  More  study 
is  required  to  determine  whether  transmission  control  needs  to  be 
redesigned  to  include  more  acknowledgments  or  reception  analysis 
needs  to  be  redesigned  to  make  better  use  of  the  information  that  is 
currently  available. 

•  Multiple  acknowledgments  can  be  received  for  the  same  message.  This 
can  occur  because  errors  keep  one  host  from  seeing  another  host’s 
message  or  becaiise  the  network  is  not  ideal  and  two  hosts  can  act 
simultaneously. 

The  following  problems  have  been  identified, 

•  The  way  that  message  timers  are  handled  is  still  not  correct.  The 
current  approach  will  not  work  if  the  following  occurs.  Assume  that 
the  latest  round  for  a  message  from  host  H  starts  with  version  1  and 
that  H  has  also  transmitted  versions  2  and  3.  Because  of  errors,  H 
has  not  seen  message  M  which  carries  an  ack  for  version  1  or  message 
N  which  carries  a  nack  for  version  2.  M  now  aurives  and  H  turns  off 
the  message’s  timer.  Now  N  arrives  and  is  ignored  because  it  nacks 
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an  old  version  of  the  message.  The  timer  for  M  will  not  be  turned  on 
again  and  some  host  has  still  not  seen  M.  The  rules  for  determining 
when  a  message  timer  can  be  turned  off,  an  ack  or  nack  for  an  old 
version  of  a  message  can  be  ignored,  a  message  can  be  removed,  and 
a  message  should  be  rebroadcast  must  be  reexamined. 

•  A  host  must  form  a  pending  nack  for  every  version  of  a  message  that 
it  sees  acked  in  another  message.  This  contradicts  the  rule  that  says 
that  only  one  pending  nack  must  be  maintained  for  a  message.  In 
fact,  a  pending  nack  must  be  maintained  for  each  transmission  of  a 
message.  This  can  be  seen  in  the  following  situation  (see  Figure  6.6). 
Assume  that  there  are  four  hosts  A,  B,  C,  and  D.  A  broadcasts  mes- 

Z  Z» 

ack  |\  nack  I  ack 
I  \  I 

I  \  I 

Y  I  X 

ack  \  I _ /  ack 

\l/ 

W 

Figure  6.6:  Pending  nack  must  be  Maintained  for  each  Retransmission 

sage  Z  which  is  seen  by  B  and  missed  by  C  and  D.  Before  B  can 
respond,  Z’s  message  timer  goes  off  and  A  retransmits  Z  as  Z'.  Z'  is 
ignored  by  B,  seen  by  C  and  missed  by  D.  Now  B  transmits  message 
Y  which  carries  an  ack  for  Z.  It  is  seen  by  A  and  D  but  missed  by  C. 
D  forms  a  pending  nack  for  Z.  C  now  transmits  message  X  which  car¬ 
ries  an  ack  for  Z'.  X  is  seen  by  everyone.  D  does  not  form  a  pending 
nack  for  Z'  because  it  sdready  has  one  for  Z.  Now  D  issues  message 
W  which  carries  a  nack  for  Z.  The  current  situation  is  shown  in  Fig¬ 
ure  6.6.  If  A  now  performs  reception  analysis,  it  would  incorrectly 
determine  that  D  has  received  Z'.  This  is  an  example  of  a  situation 
where  reception  analysis  requires  more  information  than  transmission 
control. 


•  A  similair  situation  occurs  when  a  host  H  has  a  pending  nack  for  mes¬ 
sage  M  and  sees  a  message  N  which  carries  a  nack  for  M.  According 
to  the  transmission  control  rules,  H  should  replace  its  pending  nack 
with  an  ack  for  N.  Actually,  it  should  just  add  a  pending  ack  for  N 
if  its  nack  is  for  a  different  version  of  M  them  the  nack  carried  by 
N.  Otherwise,  reception  ancilysis  would  incorrectly  find  a  reception 
path.  This  situation  is  shown  in  Figure  6.7.  Assuming  that  the  host 
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W 


Figure  6.7:  Retransmissions  Require  Individual  nacks 

that  issued  W  did  not  see  Z,  message  W  should  czmry  a  nack  for  Z 
and  shouldn’t  have  removed  it  because  of  the  nack  for  Z'  carried  by 
message  X. 

•  Similar  questions  arise  about  other  rules  for  replacing  acknowledg¬ 
ments.  For  example,  it  appears  likely  that  older  acknowledgments  for 
a  message  should  not  replace  newer  acknowledgments  for  the  message. 


6.6  Conclusions  and  Recommendations  for 
Future  Work 

Our  prototype  implementation  of  the  TRANS  broadcast  protocol  was  un¬ 
dertaken  very  near  the  end  of  the  contract  with  only  very  liruited  time  and 
money  available.  Unfortunately,  it  did  not  prove  possible  to  implement  the 
protocol,  solve  its  outstanding  problems,  and  undertake  an  extensive  per¬ 
formance  evaluation  within  the  resources  available.  We  chose  to  concen- 
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trate  on  completing  the  prototype  implementation,  identifying  problems, 
and  obtaining  preliminary  performance  measurements. 

Our  initizd  measurements  on  the  performance  of  TraNS  have  been  fa¬ 
vorable.  Its  storage  requirements,  network  bandwidth  usage,  and  reception 
detection  latency  were  all  found  to  be  quite  low.  It  appears  that  TRANS 
would  make  an  excellent  foundation  for  a  variety  of  protocols  and  dis¬ 
tributed  systems  algorithms  for  broadcast  environments. 

Our  prototype  implementation  of  TRANS  provides  an  excellent  test-bed 
environment  for  further  work  on  this  and  related  broadcast  protocols.  Some 
recommendations  for  future  work  that  could  build  on  the  our  achievement 
so  far  are  presented  below. 

6.6.1  Corrections  and  Formal  Specification 

A  great  deal  was  learned  about  the  behavior  of  TraNS  during  this  inves¬ 
tigation  and  several  subtle  problems  were  uncovered.  We  do  not  consider 
the  remaining  problems  to  pose  significant  difficulties,  but  simply  had  in¬ 
sufficient  time  to  resolve  them.  Further  investigation  and  development  of 
Trans  must  begin  with  the  correction  of  the  problems  already  discovered. 
Because  of  the  subtlety  of  the  issues  involved,  it  is  important  for  a  corrected 
version  of  TRANS  to  be  specified  formally  and  completely,  and  subject  to 
formal  analysis  and  proof.  There  are  at  least  three  properties  of  Trans 
that  should  be  specified  and  formally  verified: 

Liveness  of  Transmission  Control:  we  need  to  be  sure  that  a  message 
will  be  rebroadcast  until  all  hosts  have  received  it. 

Safety  of  Reception  Analysis:  we  need  to  be  sure  that  when  the  re¬ 
ception  analysis  algorithm  declares  that  a  particular  host  has  seen  a 
particular  message,  then  indeed  it  hzis  seen  that  ’^“'’sage. 

Liveness  of  Reception  Analysis:  we  need  to  be  sui  .t  if  a  host  has 
received  a  message,  then  eventually  the  reception  analysis  algorithm 
will  enable  the  sender  of  the  message  to  discover  that  fact. 

The  proof  given  for  Theorem  2  in  Part  III  of  this  report  is  a  proof  of 
the  second  of  these  properties.  Although  valid,  it  is  deficient  in  that  it 
is  conservative:  it  does  not  address  the  issues  surrounding  retransmissions 
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and  multiple  versions.  A  complete  formal  analysis  of  TRANS  would  be  an 
extremely  challenging  and  worthwhile  exercise,  since  the  protocol  is  one  of 
the  most  subtle  distributed  algorithms  we  have  encountered. 

6.6.2  Additional  Performance  Measurements 

In  the  time  available,  we  were  able  to  perform  only  very  limited  performance 
measurements  on  our  implementation  of  TRANS.  Additional  measurement 
and  performance  evaluation  is  clearly  necessary  in  order  to  determine  the 
practical  utility  and  characteristics  of  TraNS.  Among  the  performance 
properties  that  should  be  investigated  we  suggest  the  following  as  particu¬ 
larly  relevant: 

•  Compare  against  point  to  point  aind  existing  protocols  for  broadcast 
communications. 

•  Vary  more  parameters,  and  vary  parameters  over  a  wider  range.  Pa¬ 
rameters  include  message  timeout,  no-message  timeout,  error  rates, 
number  of  hosts,  and  their  communication  patterns  (e.g.,  equally  ac¬ 
tive  hosts,  one  major  sender  with  the  rest  mainly  passive).  Observe 
the  change  in  performance  characteristics  of  the  protocol  as  these 
parameters  change,  and  check  that  it  is  robust. 

6.6.3  Performance  Improvements 

Although  Trans  performed  well  in  the  initial  measurements,  there  are  sev¬ 
eral  alternative  versions  that  appear  to  offer  better  performance  or  different 
trade-offs.  There  are  many  ways  to  measure  a  protocol,  such  as  the  number 
of  packets  sent,  the  total  number  of  bits  sent,  the  percentage  of  overhead 
information  sent,  the  amount  of  state  that  must  be  kept,  the  amount  of 
processing  required,  the  time  until  all  hosts  have  received  a  message,  the 
time  imtil  a  sender  realizes  that  all  hosts  have  received  one  of  its  messages, 
etc.  There  are  tradeoffs  between  these  properties  that  should  be  evaluated 
for  different  environments  and  requirements. 

Trans  is  optimized  to  reduce  network  traffic,  especially  the  number  of 
acknowledgments  that  need  to  be  sent.  This  is  accomplished  by  establishing 
acknowledgment  chains  and  reducing  the  number  of  explicit  acknowledg¬ 
ments  to  a  L  ’nimum.  The  problem  is  that  these  acknowledgment  chains 


193 


become  very  tenuous  and  contain  very  little  redundant  information.  As  the 
error  rate  increases,  it  can  be  very  difficult  for  a  host  to  determine  that  other 
hosts  have  received  particular  messages.  This  can  cause  so  many  retrans¬ 
missions  that  the  initial  network  traffic  savings  will  be  negated.  TraNS 
tends  to  tavor  low  network  traffic  over  low  latency,  storage,  and  process¬ 
ing  times.  As  the  error  rate  increases,  there  is  the  danger  of  significant 
increases  in  storage,  processing,  and  reception  detection  latency  times.  In 
general,  it  appears  that  minimizing  the  number  of  acknowledgments  is  not 
always  the  best  choice. 

Small  changes  in  the  protocol  can  address  these  problems.  A  receiver 
always  knows  exactly  what  it  has  seen  whereas  it  may  be  hard  for  a  sender  or 
a  third  party  to  determine  whether  that  receiver  has  received  a  particular 
message.  The  TRANS  reception  analysis  algorithm  must  err  on  the 'side 
of  caution — it  must  fail  to  conclude  that  a  message  has  been  received  if 
there  is  any  doubt  that  it  has.  We  believe  that  considerable  improvements 
in  some  performance  characteristics  can  be  obtained  by  having  receivers 
send  a  few  redundant  acks  and  nacks  in  order  to  resolve  ambiguities  in 
a  sender’s  reception  analysis.  For  example,  a  direct  ack  can  always  be 
believed,  even  if  other  acknowledgment  paths  to  the  saune  message  contain 
nacks.  Thus  a  host  that  has  received  a  message  that  it  sees  others  nacking 
can  unambiguously  affirm  its  reception  of  the  message  by  appending  its  ack 
directly  to  one  of  its  own  messages,  rather  than  relying  on  transitivity. 

6.6.4  Extensions  to  Functionality 

Our  implementation  of  TranS  provides  for  broadcast  communication  us¬ 
ing  a  physically  broadcast  medium.  Extensions  to  the  functionality  of  the 
protocol  could  include  multicast  and  the  extension  to  bridged  collections  of 
broadcast  networks  where  only  a  subset  of  hosts  sees  each  broadcast. 

6.6.5  Use  of  Broadcast  Communications  in  Distributed 
Algorithms 

Mutual  exclusion,  locking,  synchronization,  and  distributed  database  up¬ 
date  algorithms  provide  good  examples  of  applications  in  which  broadcast 
communication  can  provide  substantial  benefits. 


Consider,  for  example,  a  tactical  environment  comprising  multiple  sen¬ 
sors,  databases,  and  actuators  using  broadcast  communications.  As  each 
sensor  broadcasts  its  readings,  those  databzises  that  hear  the  broadcast  will 
update  their  records.  When  an  actuator  subsequently  broadcasts  a  request 
for  a  value,  any  database  that  hears  the  request  cam  broadcast  a  value  in 
reply.  Other  databases  will  see  this  request-response  and  can  use  it  to  up¬ 
date  their  own  records  of  the  value  concerned  (if  they  missed  the  latest 
sensor  broadcast),  or  can  override  it  with  a  broadcast  of  their  own  if  they 
see  that  the  first  response  produced  an  obsolete  value.  In  this  way,  replica¬ 
tion  and  consistency  of  the  databases  is  achieved  in  a  very  robust  manner, 
with  very  little  message  overhead,  and  very  little  explicit  coordination.  Of 
covirse,  the  metric  that  determines  which  values  are  more  desirable  need 
not  be  based  simply  on  time  (where  newer  values  are  preferred),  but  could 
consider  accuracy  or  other  properties  of  the  data. 

6.6.6  Concluding  Remarks 

As  far  as  we  are  aware,  TRANS  is  the  first  protocol  that  exploits  the  charac¬ 
teristics  of  broadcast  communications  in  order  to  achieve  more  than  simple 
broadcasts.  TRANS  uses  the  fact  that  all  parties  see  the  traffic  of  all  others 
to  significantly  reduce  the  number  of  explicit  acknowledgments  that  are 
required.  The  price  paid  is  in  the  latency  of  confirmed  message  reception, 
in  the  complexity  of  the  reception  analysis  algorithm,  and  in  the  space 
required  to  store  information  needed  by  that  algorithm.  Our  prototype  im¬ 
plementation  indicates  that  these  costs  are  not  excessive  and  that  protocols 
of  this  kind  should  be  viable  in  practice. 

Our  implementation  revealed  some  problems  with  the  protocol  as  it 
stands  at  present.  In  the  time  available  we  were  unable  to  correct  all  the 
problems  encountered  and  were  also  unable  to  collect  all  the  performance 
data  required  for  a  full  evaluation.  Given  our  prototype  implementation, 
it  would  require  relatively  little  additional  work  to  remedy  the  outstanding 
problems  and  perform  a  substantial  performance  evaluation.  We  have  iden¬ 
tified  simple  modifications  to  the  TRANS  protocol  that  could  significantly 
improve  some  of  its  performance  characteristics  and  we  have  identified  sit¬ 
uations  (such  as  the  management  of  replicated  databases)  in  which  the  use 
of  broadcast  communications  could  support  new  algorithms  for  the  coor¬ 
dination  of  distributed  systems.  Our  prototype  implementation  of  TRANS 
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provides  an  excellent  test-bed  environment  for  further  work  on  this  and 
related  broadcast  protocols  and  algorithms. 
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Rome  Air  Development  Center 


RADC  plans  and  executes  research,  development,  test  and 
selected  acquisition  programs  in  support  of  Command,  Control, 
Communications  and  Intelligence  (C^I)  activities.  Technical  and 
engineering  support  within  areas  of  competence  is  provided  to 
ESD  Program  Offices  (POs)  and  other  ESD  elements  to 
perform  effective  acquisition  of  C^I  systems.  The  areas  of 
technical  competence  include  communications,  command  and 
control,  battle  management  information  proces.sing,  surveillance 


^  sensors,  intelligence  data  collection  and  handling,  solid  state 
\  sciences,  electromagnetics,  and  propagation,  and  electronic 
^  reliability / maintainability  and  compatibility 
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